Mathematical processes for determination of peptidase cleavage

ABSTRACT

This invention relates to the identification of peptidase cleavage sites in proteins and in particular to identification protease cleavage by the endopeptidases. The present invention utilizes a bioinformatic methodology for prediction of peptidase cleavage sites based on principal component analysis and based on training sets obtained by experimental protein cleavage. This invention is not limited to training sets derived from CSL approaches, nor to any other experimental determination of cleavage site.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application is a continuation of U.S. patent applicationSer. No. 14/895,748, filed Dec. 3, 2015, allowed as U.S. Pat. No.11,069,427, which is a Section 371 U.S. national stage entry ofInternational Patent Application No. PCT/US2014/041525, InternationalFiling Date Jun. 9, 2014, which claims the benefit of U.S. Prov. Pat.Appl. 61/833,256, filed Jun. 10, 2013, each of which is incorporated byreference herein in their entireties.

INCORPORATION-BY-REFERENCE OF MATERIAL SUBMITTED ELECTRONICALLY

Incorporated by reference in its entirety herein is a computer-readablenucleotide/amino acid sequence listing submitted concurrently herewithand identified as follows: One 5,899,000 Byte ASCII (Text) file named“32736-304_ST25” created on Jul. 15, 2021.

FIELD OF THE INVENTION

This invention relates to the identification of peptidase cleavage sitesin proteins and in particular to identification protease cleavage by theendopeptidases.

BACKGROUND OF THE INVENTION

Peptidases perform a wide variety of key functions in biology, rangingfrom digestion of dietary protein to control of physiologic processesand function of the immune response. Peptidases are present in all lifeforms including animals and plants, eukaryotes, prokaryotes and viruses.Two broad categories are the exopeptidases and the endopeptidases.Exopeptidases cleave amino acids from the N terminal or C terminal endsof a protein or amino acid sequence. Endopeptidases cleave amino acidsequences internally.

Peptidases are also widely used in manufacturing, including but notlimited to wood and leather processing, paper manufacturing, in thedetergent industry, textile manufacturing and in the food industry.Proteolytic enzymes account for nearly 60% of the industrial enzymemarket in the world. Industrial peptidases may be derived from naturalsources or may be synthetic.

Peptidases play a key role in processing and presentation of peptideswhich stimulate each arm of the immune response. B cell epitopes arereleased by a peptidase. Proteins are cleaved by peptidase within theendosome to allow short peptides to be released which are thentransported to the surface bound by MHC molecules and presented to Tcell receptors to stimulate a T cell response. This can result inup-regulation of cytotoxic T cells or T helper cells needed to aid theantibody response, and stimulation of regulatory T cell responses whichresult in down regulation of the cellular immune response. The positionof the cleavage site is critical to determine which peptides arereleased and hence whether the peptides released have a high bindingaffinity to a MHC molecules facilitating their traffic and presentation.

While MHC molecules differ between individuals of differing geneticscausing a wide variety of immune responses, the peptidase functionappears to be relatively constant at least within a particular isotypeof a peptidase. Antigen presenting cells such as dendritic cells have avariety of peptidases which contribute to cleavage of proteins and leadto T cell epitope presentation. These enzymes include but are notlimited to the cathepsins, including cathepsin L, cathepsin B, cathepsinS. There are species differences between the proteolytic enzymes indendritic cells and although similar names have been affixed toapparently comparable enzymes based on historical studies, study oftheir sequences indicates that proteins of similar name designation arenot necessarily orthologous.

Protease inhibitors are an important class of medicinal products.Peptidase inhibitors are used to intervene in viral replication in anumber of diseases such as HIV AIDs, HCV and others. Protease inhibitorshave also been developed to function as metabolic inhibitors in diabetes2 and other diseases.

It has been a long sought-after goal to be able to predict endopeptidasecleavage. A few endopeptidases have more easily characterizable cleavagesites, such as trypsin which cleaves the bond at the C terminal side ofa lysine or an arginine amino acid.

Determinations of cleavage sites have been catalogued, for instance indatabases such as Merops. See, e.g., Rawlings, N. D., Barrett, A. J. &Bateman, A. (2012) MEROPS: the database of proteolytic enzymes, theirsubstrates and inhibitors. Nucleic Acids Res 40, D343-D350). The effortto classify endopeptidase cleavage sites has often been based on limitedexperimental data. Peptidases which provide very specific cleavage ofproteins, for instance cathepsins that are active in antigenpresentation, have more complex cleavage sites that have challengedresearchers.

A recent addition to techniques available for identifying peptidasecleavage sites includes cleavage site labeling (CSL) techniques. See,e.g., Impens F, Colaert N, Helsens K, Ghesquiere B, Timmerman E, et al.(2010) A quantitative proteomics design for systematic identification ofprotease cleavage events. Mol Cell Proteomics 9: 2327-2333; Tholen S,Biniossek M L, Gessler A L, Muller S, Weisser J, et al. (2011)Contribution of cathepsin L to secretome composition and cleavagepattern of mouse embryonic fibroblasts. Biol Chem 392: 961-971;Biniossek M L, Nagler D K, Becker-Pauly C, Schilling O (2011) Proteomicidentification of protease cleavage sites characterizes prime andnon-prime specificity of cysteine cathepsins B, L, and S. J Proteome Res10: 5363-5373. CSL is an accurate but expensive and time consumingprocess requiring a complex experimental set up with expensiveequipment. Each protein must be examined individually for each enzyme.

It would be of great utility to be able to predict from the experimentalresults obtained with one protein how a given enzyme may cleave otherproteins, so that a one or a few experiments using CSL or anotherexperimental approach can provide the basis of a generalizablepredictions for a given peptidase which can be applied to any protein ofinterest whether natural or synthetic to predict the cleavage.

SUMMARY OF THE INVENTION

The present invention provides a mathematical methodology for predictionof peptidase cleavage sites based on principal component analysis andbased on training sets obtained by experimental protein cleavage. Thisinvention is not limited to training sets derived from CSL approaches,nor to any other experimental determination of cleavage site. It is alsoanticipated that there will be new approaches developed for experimentalmeasurement of cleavage sites and these too may be the source oftraining sets for the present invention.

Accordingly, the present invention is directed to a method foridentification in silico of prediction of peptide dimers, cleavage sitedimers, which are located centrally within a longer peptide, mostcommonly but not limited to an octomer, in which the dimer spans ascissile bond and which has a high probability of cleavage by specificpeptidases. In some embodiments the peptides that are cleaved may beexogenous to the host (i.e., pathogens. allergens, or synthetic proteinbiopharmaceuticals); in other embodiments they are endogenous (as inautoimmunity or tumor associated antigens).

In some embodiments the present invention provides processes, preferablycomputer implemented, for the derivation of ensembles of equations forthe prediction of peptidase cleavage. In some embodiments the processcomprises generating mathematical expressions based on multipleuncorrelated physical parameters of amino acids, wherein saidmathematical expressions serve as descriptors of a peptide. The peptidedescriptor is then applied to a set of peptides for which the cleavagesite and probability has been experimentally determined. Themathematical descriptors and the experimental data are then compared anda prediction equation derived for cleavage of the scissile bond betweeneach possible pair of amino acids located in the cleavage site dimerpositions. In some embodiments the process is then repeated to derive anequation for each possible pair of amino acids in a cleavage site dimer.The assemblage of equations for every possible cleavage site dimer thenconstitutes an ensemble of predictive equations.

In some embodiments the mathematical expression which is a peptidedescriptor is derived by analyzing more than one uncorrelated physicalparameters of an amino acid via a computer processor, and constructing acorrelation matrix of said physical parameters. In some embodiments thispermits the derivation of multiple mutually orthogonal or uncorrelatedproxies, wherein said proxies are weighted and ranked to providedescriptors of the amino acids. In further embodiments a number of theproxies which contribute most to the description of the amino acidvariability are then selected to serve as descriptors. In someembodiments this number of proxies may be three or more. In someembodiments the proxies are principal components. In furtherembodiments, by combining the mathematical expression comprising severalproxies describing each amino acid in a peptide, a mathematicaldescriptor for the peptide is derived.

In some embodiments the computer assisted process of assembling anensemble of equations to predict peptide cleavage requires firstderiving a predictive equation for each cleavage site dimer pair ofamino acids. By examining a set of peptides for which the cleavage isknown, cleavage site dimers are identified which are comprised ofidentical amino acids but which are located in peptides that are eithercleaved or that are uncleaved. A sub set of peptides with theseproperties is randomly selected. By comparison of the cleaved anduncleaved peptides with the aforesaid peptide descriptors, a firstequation is derived to predict cleavage for that cleavage site dimer.This process is repeated on a second random sub set of the peptides andthen repeated multiple times, each time expanding the ensemble ofequations which can be polled and thus enhancing the precision of theprediction for the particular cleavage site dimer. In some embodimentsthe derivation of an ensemble of predictive equations is then conductedfor other cleavage site dimer amino acid pairs until the maximum of 400possible pairs has been examined and the corresponding ensembles ofpredictive equations derived form an ensemble of predictive equationsapplicable to all potential cleavage site dimers.

In some embodiments the predictive equations are then applied to aprotein of interest, wherein the invention provides for inputting theprotein of interest into a computer and applying amino acid descriptorsbased on multiple uncorrelated physical parameters to provide a peptidedescriptor for each peptide comprised of a subset of amino acids fromwithin the protein of interest. In some embodiments the process thenfurther comprises applying the peptidase prediction equation ensemble topredict the cleavage site dimers in the peptides from the protein ofinterest and the probability of cleavage of each cleavage site dimer. Insome particular embodiments the probability of cleavage may be 60% or70% or 80% or 90% or higher.

In some embodiments the peptidase is an endopeptidase. In someembodiments the peptidase is a serine peptidase, in other embodimentsthe peptidases is a serine peptidase, a cysteine peptidase, an asparticpeptidase, a glutamic peptidase, an asparagine peptidase, a threoninepeptidase, or a metallopeptidase.

In some embodiments the peptidase is located within the endosome of acell. In some instances said cell is an antigen presenting cell. In someembodiments the cell is a dendritic cell, in other embodiments it is a Bcell, and in yet further embodiments it is a macrophage.

The present invention provides for methods to predict cleavage bypeptidases, including mammalian peptides such as human and mousepeptidases. These examples are not limiting and the invention describedis equally applicable to analysis of cleavages by peptidases of othereukaryotic or prokaryotic species or to synthetically engineeredpeptidases upon provision of relevant training sets for that enzyme. Thetraining sets may be derived from many experimental techniques known tothose skilled in the art, including but not limited to cleavage sitelabeling.

In some instances, the peptide which defines the context of the cleavagesite is an octomer, designated the cleavage site octomer (CSO). In otherinstances the peptide which defines the context of a cleavage site maycomprise fewer or greater numbers of amino acids. Based on the work ofSchechter & Berger to describe the specificity of papain andcrystallographic structures of peptidases the active site of a peptidaseenzyme is commonly located in a groove on the surface of the moleculebetween adjacent structural domains, and the peptide substratespecificity is dictated by the properties of binding sites arrangedalong the groove on one or both sides of the catalytic site that isresponsible for hydrolysis of the scissile bond. See, e.g., Schechter,I. & Berger, A. On the size of the active site in proteases. I. Papain.Biochem. Biophys. Res. Commun. 27, 157-162 (1967). Accordingly, thespecificity of a peptidase is described by use of a conceptual model inwhich each specificity subsite is able to accommodate the sidechain of asingle amino acid residue. The sites are numbered from the catalyticsite, S1, S1 . . . Sn towards the N-terminus of the substrate, and S1′,S2′ . . . Sn′ towards the C-terminus. The residues they accommodate arenumbered P1, P2 . . . Pn, and P1′, P2′ . . . Pn′, respectively, asfollows:

-   -   Substrate: -P4 P3-P2-P1˜P1′-P2′-P3′ P4′-    -   Enzyme: S4-S3-S2-51*51′-S2′-S3′ S4′

In some embodiments the scissile bond is located between two amino acidsP1 and P1′ designated as the cleavage site dimer.

In some embodiments, the analyzing physical parameters of subsets ofamino acids comprises replacing alphabetical coding of individual aminoacids in the subset with mathematical expressions. In some embodiments,the physical properties or physical parameters are represented by one ormore principal components. In some embodiments, the physical parametersare represented by at least three principal components or 3, 4, 5, or 6principal components. In some embodiments, the letter code for eachamino acid in the subset is transformed to at least one mathematicalexpression. In some embodiments, the mathematical expression is derivedfrom principal component analysis of amino acid physical properties. Insome embodiments, the letter code for each amino acid in the subset istransformed to a three number representation. In some embodiments saidprincipal components are mutually uncorrelated, or orthogonal. In someembodiments, the principal components are weighted and ranked proxiesfor the physical properties of the amino acids in the subset. In someembodiments, the physical properties are selected from the groupconsisting of polarity, optimized matching hydrophobicity,hydropathicity, hydropathcity expressed as free energy of transfer tosurface in kcal/mole, hydrophobicity scale based on free energy oftransfer in kcal/mole, hydrophobicity expressed as ΔG ½ cal,hydrophobicity scale derived from 3D data, hydrophobicity scalerepresented as it-r, molar fraction of buried residues, proportion ofresidues 95% buried, free energy of transfer from inside to outside of aglobular protein, hydration potential in kcal/mol, membrane buried helixparameter, mean fractional area loss, average area buried on transferfrom standard state to folded protein, molar fraction of accessibleresidues, hydrophilicity, normalized consensus hydrophobicity scale,average surrounding hydrophobicity, hydrophobicity of physiologicalL-amino acids, hydrophobicity scale represented as (π−r)², retentioncoefficient in HFBA, retention coefficient in HPLC pH 2.1,hydrophobicity scale derived from HPLC peptide retention times,hydrophobicity indices at pH 7.5 determined by HPLC, retentioncoefficient in TFA, retention coefficient in HPLC pH 7.4, hydrophobicityindices at pH 3.4 determined by HPLC, mobilities of amino acids onchromatography paper, hydrophobic constants derived from HPLC peptideretention times, and combinations thereof. In some embodiments, thephysical properties are predictive of the property of the peptides whichcomprise the cleavage site of a peptidase.

In some embodiments the process comprises application of a classifier.Examples of classifiers include but are not limited to neural nets andsupport vector machines. In preferred embodiments the classifier is aprobabilistic classifier. In some embodiments, the processes comprisesapplying a classifier via the computer, wherein the classifier is usedto predict the peptide which spans the scissile bond of a peptidase andthe location of said cleavage site. In some embodiments, the classifierprovides a quantitative structure activity relationship. In someembodiments, the first three principal components represent more than80% of physical properties of an amino acid.

In some embodiments, the classifier is a neural network and theprocesses further comprise constructing a multi-layer perceptron neuralnetwork regression process wherein the output is the probability ofcleavage at a given bond within a particular peptide or amino acidsubset, surrounding a peptidase cleavage site. In preferred embodimentssaid amino acid subset is an octomer. In some embodiments, theregression process produces a series of equations that allow predictionof the cleavage site using the physical properties of the subsets ofamino acids. In some embodiments, the processes further compriseutilizing a number of hidden nodes in the multi-layer perceptron thatcorrelates to the eight amino acids in the cleavage site octomer. Inother embodiments a neural net of more or less hidden nodes may be used.In some embodiments, the neural network is validated with a training setof cleavage sites within peptides of known amino acid sequence. In someembodiments such training sets are derived from the experimental resultsof output from CSL procedures. See, e.g., Impens et al., Tholen et al.,and Biniossek et al., referenced above. Other sources of training setswhich provide experimental data on the site of peptidase cleavage may beapplied and so the source of training set is not limiting. In yet otherembodiments the classifier comprises application of a support vectormachine. See, e.g., Cortes C, Vapnik V (1995) Support-vector network.Mach Learn 20: 1-25; Scholkopf B, Smola A J, Williamson R C, Bartlett PL (2000) New support vector algorithms. Neural Comput 12: 1207-1245;Bennett K P, Campbell C (2000) Support vector machines: Hype orHallelujah? SIGKDD Explorations 2. Other types of classifiers may beapplied.

In some embodiments, the amino acid sequence comprises the amino acidsequences of a class of proteins selected from the group derived fromthe proteome of pathogenic microorganisms. In other embodiments theamino acid sequences derive from a class of proteins selected from thegroup comprising allergens (including but not limited to plant allergenproteins and food allergens). In other embodiments the amino acidsequences derive from a class of proteins selected from the groupcomprising mammalian proteins including but not limited to tumorassociated antigen proteins, proteins reactive in autoimmunity,immunoglobulins, enzymes and structural mammalian proteins. In otherembodiments the amino acid sequences derive from a class of proteinsselected from the group comprising synthetic and recombinantlymanufactured proteins, including but not limited to biopharmaceuticals(e.g., replacement enzymes, clotting factors, monoclonal antibodies andantibody fusions) and industrial proteins (for example in foodadditives, textiles, wood). These examples however should not beconsidered limiting as the analytical approach can be applied to anypeptidase of any species or source provided training sets can bedeveloped experimentally by any means for that enzyme. Furthermore thepredictions of cleavage by any selected peptidase may then be applied toproteins of any source.

In some embodiments, at least 80% of possible amino acid subsets withina protein are analyzed for predicted cleavage sites. In furtherembodiments all the proteins within an organism may be analyzed, with atleast 80% of possible amino acid subsets in each protein being analyzed.In other embodiments all the proteins within a natural tissue (e.g.,muscle) or within a composite industrial product (e.g., paper) may beanalyzed to determine predicted cleavage sites.

In some embodiments determination of the cleavage site is used inselecting a peptide for inclusion in an immunogen or vaccine, such thatthe immunogen will be cleaved predictably for binding by an MHC moleculeand presentation at the cell surface to a T cell receptor and hencestimulate immunity, or such that a B cell epitope is bound to a B cellreceptor and internalized by the B cell. In some embodiments saidimmunogenic peptide is selected to provide a cleavage site 4, or 5 or 6or up to 20 amino acids from the N terminal or the C terminal of theselected peptide. In some embodiments the peptide is selected such thatthe predicted cleavage site is separated from the immunogenic peptide byflanking regions of 1, or 2 or 3 or up to 20 amino acids.

In some embodiments an amino acid sequence is analyzed to determine thecleavage sites for one specific peptidase. In other embodiments theamino acid sequence is analyzed to predict the cleavage sites of 2 or 3or 4 or more peptidase enzymes acting in sequence. In furtherembodiments the sequence of action of the peptidases may be varied.

In some embodiments, the present invention provides a computer system orcomputer readable medium comprising a neural network that determinespeptidase cleavage sites within an amino acid sequence.

In some embodiments, the present invention provides a computer systemconfigured to provide an output comprising a graphical representation ofthe location of the cleavage sites within an amino acid sequence,wherein the amino acid sequence forms one axis and the cleavage sites ofone or more peptidases are charted against the amino acid sequence axis.In some embodiments, the present invention provides a syntheticpolypeptide wherein amino acids flanking a high affinity MHC bindingpeptide have been substituted to change (e.g., increase or reduce) theprobability of cathepsin cleavage. In some embodiments, the change is anincrease in the probability of cleavage. In some embodiments, the changeis a decrease in the probability of cleavage. In some embodiments, 2, 3,or 4 amino acids have been substituted to change the probability ofcathepsin cleavage. In some embodiments, the amino acids which aresubstituted are located at the C terminal of a high affinity MHC bindingpeptide. In some embodiments, the amino acids which are substituted arelocated at the N terminal of a high affinity MHC binding peptide. Insome embodiments, the probability of cathepsin cleavage is changed by atleast 2 fold (e.g., increased or decreased by 100%). In someembodiments, the probability of cathepsin L cleavage is changed. In someembodiments, the probability of cathepsin S cleavage is changed. In someembodiments, the probability of cathepsin B cleavage is changed. In someembodiments, the peptide comprises one or more MHC binding peptides anda B cell binding peptide. In some embodiments, the peptide is animmunogen. In some embodiments the amino acids which are substituted liewithin 1-10 amino acid positions of the N terminal of a MHC bindingpeptide, in yet other embodiments said amino acids which are substitutedlie within 1-10 amino acids of the C terminal of an MHC binding peptide.

DESCRIPTION OF THE FIGURES

FIG. 1. Differences between the mean z-scale vector of cleaved anduncleaved peptides that are P1-anchored. Data is from a series oftraining set cohorts each consisting of N cleaved peptides and 4Nrandom, uncleaved peptides with matching amino acids at the P1 position.The differential in the first principal component z₁(polarity/hydrophobicity related) and the second principal component z₂(size related) is defined as (Mean PC (cleaved set)−Mean PC (randomset)). An individual colored line is plotted for each amino acid that isfound in the P1 position in the active cleavage site octomer. Panel a)Human cathepsin L z₁-alanine anchor highlighted; b) Murine cathepsinL-alanine anchor highlighted; c) Human cathepsin L z₂-alanine anchorhighlighted; d) Murine cathepsin L-alanine anchor highlighted. Alanineis selected to be highlighted only because it is the first in thealphabetic list.

FIG. 2A-G. Variation in the dispersion of the underlying principalcomponents of the amino acids (z-scale vectors) is shown at eachposition in the cleavage site octomer (CSO) for different cathepsinsunder several experimental conditions. As in FIG. 1, each peptide set isanchored by matched amino acids at the P1 position in the CSO. The greendiamond symbols plot the mean and the confidence limits of the standarddeviations of the cleaved sets of peptides for each position in the CSO.The extremes of the boxes are the 25^(th) and 75^(th) quantile and theextremes of the red lines are the 10^(th) and 90^(th) percentile. Thedashed line is plotted at the −2 sigma line (i.e. 95^(th) percentileconfidence) of the random, uncleaved cohort set and is constant at allpositions for all proteases. a) P4; b) P3; c) P2; d) P1′; e) P2′; f)P3′; g)P4′. The solid line is the overall mean. Datasets used were fromreferences indicated in the text. hCAT B=Human cathepsin B digested for4 or 16 hrs; hCAT L=Human cathepsin L digested for 4 or 16 hrs; hCATL16* Human cathepsin L digested for 16 hrs using a different peptidelibrary as substrate. mCAT D, E, L=murine cathepsins results obtained asdescribed in source references.

FIG. 3. Shape of the unit-integral weighting function used forregression fitting is shown for several different weighting factors. [1](blue)=uniform {pattern=1, 1, 1, 1, 1, 1, 1, 1, Σ=8} principal componentvalues used; [2] (red)={pattern=1, 2, 4, 8, 8, 4, 2, 1, Σ=30} multiplierapplied; [3] (green)={pattern=1, 1.5, 2.25, 3.375, 3.375, 2.25, 1.5, 1,Σ=16.25} multiplier applied; [4] (purple)={pattern=1, 1.25, 1.5625,1.953125, 1.953125, 1.5625, 1.25, 1, Σ=11.53125} multiplier applied toall three principal components used as predictors. All multipliers werenormalized to the Σ to prevent scaling effects in the predictions.

FIG. 4A-D. Sensitivity and specificity patterns for cleavage predictionsfor human cathepsin L (16 hr time point) is shown in a layout of astandard contingency table or confusion matrix format. a) TN=truenegative; b) FP=false positive; c) FN=false negative; d) TP=truepositive cleavage predictions. The inset ‘shadowgrams’ are ahistogram-like graphic where a large number (all training and validationcohorts with all weighting factors) of partially transparentdistribution kernels are overlayed and thereby simultaneously build up apattern and density. The vertical axis is probability of that particularclassification and the inset is the standard mean, median and quantilesof the underlying distributions. For the ‘heat diagrams’ the predictionsare segregated by the P1 anchor residue and are centered (gray) on theoverall mean of the four data columns. The ‘thermometers’ are scales ofthe associated probabilities. Each column represents the resultsobtained using a different weighting factor. Two-way clusters are formedby the method of Ward. Ward, J. H. Hierarchical Grouping to Optimize anObjective Function. J. Am. Stat. Assoc. 48, 236-244 (1963).

FIG. 5A-E. Impact of using a dual amino acid anchor (P1-P1′) as comparedto a single P1 anchor for prediction of murine cathepsin E. Descriptionof heat diagrams as in FIG. 4. a) Sensitivity (TP/(TP+FN)) of thepredictor output using a single amino acid anchor at P1. b) and d)Sensitivity using P1-P1′ pair anchors indicated. The numbers inparenthesis are the number of cleavage events with the particular P1-P1′pair. c) and e) FN/(TP+FN) as in b) and d).

FIG. 6. Predicted endosomal protease cleavage of myelin basic proteincompared with experimental determinations. a) murine cathepsin L; b)murine cathepsin E, c) murine cathepsin D, d) human cathepsin B; e)human cathepsin S; f) human cathepsin L. Beck, H. et al. Cathepsin S andan asparagine-specific endoprotease dominate the proteolytic processingof human myelin basic protein in vitro. Eur. J. Immunol. 31, 3726-3736(2001). The amino acid index in the full sequence of (GI 17378805;P02686) is used. The predicted cleavages are consolidated for allpredictors anchored at both P1 and P1′ amino acids as described in thetext. The peptide sequence insets are at critical degradative cleavagepositions (number) described by Beck et al and are centered at theP1-P1′ cleavage point (˜) in the panel of the protease of relevanceindicated in the reference.

FIG. 7. Artificial neural network perceptron topology. The perceptronused for prediction of cleavage has 3 layers: an input layer consistingof the vectors of the first three principal components of the aminoacids in the octomer binding site; a hidden layer consisting of 8 nodeswith symmetry to the octomer binding site; and a single output layerwhich is the cleavage prediction. An hyperbolic tangent activationfunction was used for all interconnections within the perceptronstructure.

FIG. 8. Data flow.

FIG. 9A-D. Frequency pattern logos (produced by weblogo.berkeley.edu) ofpeptides with matching P1-P1′ in the CSO for two prevalent cleavagepatterns in murine Cathepsins D and E. a) Murine Cathepsin D, 340 randomuncleaved; b) Murine Cathepsin D, 34 cleaved peptides; c) MurineCathepsin E, 660 random uncleaved; and d) Murine Cathepsin E, 66cleaved; Indices at the bottom are the standard amino acid positions inthe CSO.

FIG. 10A-D. Sensitivity and specificity patterns for human cathepsin Bcleavage predictions is shown in a layout of a standard contingencytable or confusion matrix format. a) TN=true negative; b) FP=falsepositive; c) FN=false negative; d) TP=true positive cleavagepredictions. The inset ‘shadowgrams’ are a histogram-like graphic wherea large number (all training and validation cohorts with all weightingfactors) of partially transparent distribution kernels are overlayed andthereby simultaneously build up a pattern and density. The vertical axisis probability of that particular classification and the inset is thestandard mean, median and quantiles of the underlying distributions. Forthe ‘heat diagrams’ the predictions are segregated by the P1 anchorresidue and are centered (gray) on the overall mean of the four datacolumns. The associated ‘thermometers’ are the associated probabilities.Each column represents the results obtained using a different weightingfactor. Two-way clusters are formed by the method of Ward.

FIG. 11. Sensitivity and specificity patterns for human cathepsin S atpH 7.5 cleavage predictions is shown in a layout of a standardcontingency table or confusion matrix format. a) TN=true negative; b)FP=false positive; c) FN=false negative; d) TP=true positive cleavagepredictions. The inset ‘shadowgrams’ are a histogram-like graphic wherea large number (all training and validation cohorts with all weightingfactors) of partially transparent distribution kernels are overlayed andthereby simultaneously build up a pattern and density. The vertical axisis probability of that particular classification and the inset is thestandard mean, median and quantiles of the underlying distributions. Forthe ‘heat diagrams’ the predictions are segregated by the P1 anchorresidue and are centered (gray) on the overall mean of the four datacolumns. The associated ‘thermometers’ are the associated probabilities.Each column represents the results obtained using a different weightingfactor. Two-way clusters are formed by the method of Ward.

FIGS. 12A, 12B, and 12C. Comparison of performance of differentpredictors with different benchmark datasets. Downloaded fromdtreg.com/benchmarks.htm.

FIG. 13A-D. Comparison of the performance of a probabilistic neuralnetwork (NN) and a support vector machine (SVM) as binary classifiersfor predicting cleavage of human cathepsin L. The cleavage site octomersin the peptide training sets had either an alanine or a glycine atposition P₁. (a) and (c) Glycine at P₁. Total of the cleaved trainerpeptides was 222 (indicated by the blue horizontal line). Cleavedpeptides were paired for training with 5 un-cleaved random cohorts with888 peptides in each set (indicated by red horizontal line). (b) and (d)alanine at P₁. Total of the cleaved trainer peptides was 111 (bluehorizontal line). Cleaved peptides were paired for training with 5un-cleaved random cohorts with 444 peptides in each set (red horizontalline).

FIG. 14 provides principal components on the correlations of variousphysicochemical properties of amino acids from 31 different studies.

FIG. 15A-C. Predicted peptidase cleavage sites in a Brucella melitensisprotein A. Shows the population permuted plot of predicted MHC bindingfor Brucella melitensis methionine sulphoxide reductase B. The peptidemarked in black (RYCINSASL (SEQ ID NO: 7)) was identified by Durward etal (Durward, M. A., Harms, J., Magnani, D. M., Eskra, L., & Splitter, G.A. Discordant Brucella melitensis antigens yield cognate CD8+ T cells invivo. Infect. Immun. 78, 168-176 (2010)) as capable of inducing a CD8+cytoxic response in mice. B. Shows predicted probability of cleavage inthis protein by human cathepsins L, S and B and murine cathepsins D, Eand L. C. is an expansion of a section of the plot shown in B. Each baris indexed at the start of a cleavage site octomer and hence indicatesthe probability of cleavage at the bond 4 amino acids to the right(towards C terminus). The darker colored bars show that there is a highprobability of cleavage either side of the peptide of interest locatedat positions 116-124.

FIG. 16: Peptidase cleavage sites predicted in AraH2 isoform 1(GI_224747150). Noted below are the locations of the cleavages relativeto peptides identified experimentally by Prickett et al as dominant CD4+eptiopes.

FIG. 17. Statistical characteristic of the primary CLIP peptide in thecontext of a canonical 15-mer for 28 human MHC II alleles. CLIP peptidebinds to many different MHC II molecules with a moderate affinity ofabout e6.26=525 nM equivalent to about −0.96σ (approx −1σ) below themean.

FIG. 18. Statistical characteristic of the inverted CLIP peptide in thecontext of an inverted (non-canonical) 15-mer for 28 human MHC IIalleles.

FIG. 19. Predicted MHC affinity for CLIP peptide in either the canonicalorientation or the reverse orientation in binding groove DR1(DRB1*01:01).

FIG. 20A-B. Probability of cleavage by cathepsin B, S, L in HLA class IIhistocompatability antigen isoform A. Panel A shows the proability ofcathepsin cleavage along the whole protein. Panel B expands the detailfor amino acid positions 90-120. Highlighted (darker color bars)cleavage points are high yield promiscuous self peptides reported byChicz et al 1993 (see Table 6 in reference) and shown in the insertedlabels.

FIG. 21. Construct comprising an epitope peptide of interest at theN-terminus, the hinge region and the constant regions CH2 and CH3 fromthe murine IgG2a immunoglobulin. The molecule dimerizes via formation ofdisulphide bonds at the hinge.

FIG. 22. In vivo clearance of RL9 pulsed splenocytes in immunized mice.Mice were immunized and boosted once with affinity-purifiedRL-G2a(CH2-CH3) (P661) or RL-G2a(CH2-CH3)-RL (P662) or synthetic peptideRL9. One week after the boost, RL9-pulsed (labeled CFSEhi) and unpulsed(labeled CFSElo) splenocytes from naïve mice were adoptively transferredinto the immunized mice via retrobulbar injection. Six hours posttransfer, spleens were removed from immunized mice and analyzed forsurviving pulsed, labeled target cells using flow cytometry. % specificlysis=1−[rnaive/rvaccinated]×100; where r=% CFSELo cells÷% CFSEHi cells.P660=immunized with isotype-matching, irrelevant antibody.

FIG. 23A-B. Probability of cathepsin cleavage in methionine sulphoxidereductase B. Panel A shows the predicted cleavage of murine methioninesulphoxide reductase B. Panel B shows the probability of cleavage ofBrucella melitensis methionine sulphoxide reductase B. In both panelsthe 9 mer peptide of interest RYCINSASL (SEQ ID NO: 7) is shown.

FIG. 24. Altered flanking regions of RL9 peptide.

DEFINITIONS

As used herein “peptidase” refers to an enzyme which cleaves a proteinor peptide. The term peptidase may be used interchangeably withprotease, proteinases, oligopeptidases, and proteolytic enzymes.Peptidases may be endopeptidases (endoproteases), or exopeptidases(exoproteases). Similarly the term peptidase inhibitor may be usedinterchangeably with protease inhibitor or inhibitor of any of the otheralternate terms for peptidase.

As used herein, the term “exopeptidase” refers to a peptidase thatrequires a free N-terminal amino group, C-terminal carboxyl group orboth, and hydrolyses a bond not more than three residues from theterminus. The exopeptidases are further divided into aminopeptidases,carboxypeptidases, dipeptidyl-peptidases, peptidyl-dipeptidases,tripeptidyl-peptidases and dipeptidases.

As used herein, the term “endopeptidase” refers to a peptidase thathydrolyses internal, alpha-peptide bonds in a polypeptide chain, tendingto act away from the N-terminus or C-terminus. Examples ofendopeptidases are chymotrypsin, pepsin, papain and cathepsins. A veryfew endopeptidases act a fixed distance from one terminus of thesubstrate, an example being mitochondrial intermediate peptidase. Someendopeptidases act only on substrates smaller than proteins, and theseare termed oligopeptidases. An example of an oligopeptidase is thimetoligopeptidase. Endopeptidases initiate the digestion of food proteins,generating new N- and C-termini that are substrates for theexopeptidases that complete the process. Endopeptidases also processproteins by limited proteolysis. Examples are the removal of signalpeptides from secreted proteins (e.g. signal peptidase I,) and thematuration of precursor proteins (e.g. enteropeptidase, furin,). In thenomenclature of the Nomenclature Committee of the International Union ofBiochemistry and Molecular Biology (NC-IUBMB) endopeptidases areallocated to sub-subclasses EC 3.4.21, EC 3.4.22, EC 3.4.23, EC 3.4.24and EC 3.4.25 for serine-, cysteine-, aspartic-, metallo- andthreonine-type endopeptidases, respectively.

As used herein, a “cysteine peptidase” is a peptidase characterized bythe presence of a cysteine at its active site, a serine peptidase is apeptidase characterized a serine at its active site; similarly derivedterms are used for peptidases characterized by other amino acids at theactive site.

As used herein a “metallopeptidase” refers to any protease enzyme whosecatalytic mechanism involves a metal.

As used herein the term scissile bond refers to a peptide bond that ishydrolysed by a peptidase.

As used herein, the term “specificity subsite” refers to the specificityof a peptidase for cleavage of a peptide bond with particular aminoacids in nearby positions and is described in terminology based on thatoriginally created by Schechter & Berger to describe the specificity ofpapain. Schechter, I. & Berger, A. On the size of the active site inproteases. I. Papain. Biochem. Biophys. Res. Commun. 27, 157-162 (1967).Crystallographic structures of peptidases show that the active site iscommonly located in a groove on the surface of the molecule betweenadjacent structural domains, and the substrate specificity is dictatedby the properties of binding sites arranged along the groove on one orboth sides of the catalytic site that is responsible for hydrolysis ofthe scissile bond. Accordingly, the specificity of a peptidase isdescribed by use of a conceptual model in which each specificity subsiteis able to accommodate the sidechain of a single amino acid residue. Thesites are numbered from the catalytic site, S1, S2 . . . Sn towards theN-terminus of the substrate, and S1′, S2′ . . . Sn′ towards theC-terminus. The residues they accommodate are numbered P1, P2 . . . Pn,and P1′, P2′ . . . Pn′, respectively, as follows:

-   -   Substrate: -P4 P3-P2-P1˜P1′-P2′-P3′-P4′    -   Enzyme: -S4 S3-S2-S1*S1′-S2′-S3′-S4′        In this representation the catalytic site of the enzyme is        marked * and the peptide bond cleaved (the scissile bond) is        indicated by the symbol ˜.

As used herein short sequences of amino acids of specific length may bereferred to as “dimers” comprising 2 amino acids, “trimers” comprising 3amino acids, “tetramers” comprising 4 amino acids, and “octomers”comprising eight amino acids and so forth.

As used herein, the term “cleavage site octomer” refers to the 8 aminoacids located four each side of the bond at which a peptidase cleaves anamino acid sequence. Cleavage site octomer is abbreviated as CSO.

As used herein “cleavage site dimer” refers to the amino acid pairbetween which cleavage may take place. Thus the cleavage site dimerrefers to the P1˜P1′ amino acid pair, whether or not the bond betweenthese two amino acids is actually cleaved or not. Cleavage site dimerdefines a potential cleavage site. A cleavage site dimer occupies thecentral 2 amino acids of a cleavage site octomer.

As used herein, the term “cleavage site labeling” refers to experimentaltechniques in which the amino acid exposed by peptidase cleavage islabeled. Cleavage Site Labeling may be abbreviated herein as CSL. Theprinciple is that each cleavage event by a peptidase produces a newN-terminal amino acid residue. Different chemistries can be used tolabel this new N-terminal residue and thus the following examples arenot limiting. The process has been called “Terminal amine isotopiclabeling of substrates” (TAILS). The specificity of the labeling and theaddition of elements of precisely known mass enables the detection bymass spectrometry. One specific chemistry combines the use of ¹²C and¹³C formaldehyde reactions. By labeling the control and the peptidasetreated samples separately and then mixing them before analysis in themass spectrometer the detection of the newly created N-terminal sitesresulting from peptidase cleavage are detected by the addition of ¹³C tothe newly cleaved sites. A variation of the method is to metabolicallylabel the substrate proteins to be tested in cell culture with stableisotopes of N, C or O. (called SILAC—“stable isotope labeling with aminoacids in cell culture”). The difference between the labeled samples andthe control can be subsequently detected with a mass spectrometer.

As used herein, the term “support vector machine” refers to a set ofrelated supervised learning methods used for classification andregression. Given a set of training examples, each marked as belongingto one of two categories, an SVM training algorithm builds a model thatpredicts whether a new example falls into one category or the other.

As used herein, the term “classifier” when used in relation tostatistical processes refers to processes such as neural nets andsupport vector machines.

As used herein “neural net”, which is used interchangeably with “neuralnetwork” and sometimes abbreviated as NN, refers to variousconfigurations of classifiers used in machine learning, includingmultilayered perceptrons with one or more hidden layer, support vectormachines and dynamic Bayesian networks. These methods share in commonthe ability to be trained, the quality of their training evaluated andtheir ability to make either categorical classifications or ofcontinuous numbers in a regression mode. Perceptron as used herein is aclassifier which maps its input x to an output value which is a functionof x, or a graphical representation thereof.

As used herein “recursive partitioning” or “recursive partitioningalgorithm” refers to a statistical method for multivariable analysis.Recursive partitioning operates through a decision tree that strives tocorrectly classify members of the population based on severaldichotomous dependent variables.

As used herein, the term “motif” refers to a characteristic sequence ofamino acids forming a distinctive pattern.

As used herein, the term “genome” refers to the genetic material (e.g.,chromosomes) of an organism or a host cell.

As used herein, the term “proteome” refers to the entire set of proteinsexpressed by a genome, cell, tissue or organism. A “partial proteome”refers to a subset the entire set of proteins expressed by a genome,cell, tissue or organism. Examples of “partial proteomes” include, butare not limited to, transmembrane proteins, secreted proteins, andproteins with a membrane motif.

As used herein, the terms “protein,” “polypeptide,” and “peptide” referto a molecule comprising amino acids joined via peptide bonds. Ingeneral “peptide” is used to refer to a sequence of 20 or less aminoacids and “polypeptide” is used to refer to a sequence of greater than20 amino acids.

As used herein, the term, “synthetic polypeptide,” “synthetic peptide”and “synthetic protein” refer to peptides, polypeptides, and proteinsthat are produced by a recombinant process (i.e., expression ofexogenous nucleic acid encoding the peptide, polypeptide or protein inan organism, host cell, or cell-free system) or by chemical synthesis.

As used herein, the term “protein of interest” refers to a proteinencoded by a nucleic acid of interest. It may be applied to any proteinto which further analysis is applied or the properties of which aretested or examined.

As used herein, the term “native” (or wild type) when used in referenceto a protein refers to proteins encoded by the genome of a cell, tissue,or organism, other than one manipulated to produce synthetic proteins.

As used herein, the term “B-cell epitope” refers to a polypeptidesequence that is recognized and bound by a B-cell receptor. A B-cellepitope may be a linear peptide or may comprise several discontinuoussequences which together are folded to form a structural epitope. Suchcomponent sequences which together make up a B-cell epitope are referredto herein as B-cell epitope sequences. Hence, a B cell epitope maycomprise one or more B-cell epitope sequences. A linear B-cell epitopemay comprise as few as 2-4 amino acids or more amino acids.

As used herein, the term “predicted B-cell epitope” refers to apolypeptide sequence that is predicted to bind to a B-cell receptor by acomputer program, for example, in addition to methods described herein,Bepipred (Larsen, et al., Immunome Research 2:2, 2006.) and others asreferenced by Larsen et al (ibid) (Hopp T et al PNAS 78:3824-3828, 1981;Parker J et al, Biochem. 25:5425-5432, 1986). A predicted B-cell epitopemay refer to the identification of B-cell epitope sequences forming partof a structural B-cell epitope or to a complete B-cell epitope.

As used herein, the term “T-cell epitope” refers to a polypeptidesequence bound to a major histocompatibility protein molecule in aconfiguration recognized by a T-cell receptor. Typically, T-cellepitopes are presented on the surface of an antigen-presenting cell. AT-cell epitope comprises amino acids which are exposed outwardly towardsthe T-cell (T-cell exposed motifs or TCEMs) and amino acids which aredirected inwardly towards the groove of the binding MHC molecule (Grooveexposed motifs or GEMs).

As used herein, the term “predicted T-cell epitope” refers to apolypeptide sequence that is predicted to bind to a majorhistocompatibility protein molecule by the neural network algorithmsdescribed herein or as determined experimentally.

As used herein, the term “major histocompatibility complex (MHC)” refersto the MHC Class I and MHC Class II genes and the proteins encodedthereby. Molecules of the MHC bind small peptides and present them onthe surface of cells for recognition by T-cell receptor-bearing T-cells.The MHC is both polygenic (there are several MHC class I and MHC classII genes) and polymorphic (there are multiple alleles of each gene). Theterms MHC-I MHC-II MHC-1 and MHC-2 are variously used herein to indicatethese classes of molecules. Included are both classical and nonclassicalMHC molecules. An MHC molecule is made up of multiple chains (alpha andbeta chains) which associate to form a molecule. The MHC moleculecontains a cleft which forms a binding site for peptides. Peptides boundin the cleft may then be presented to T-cell receptors. The term “MHCbinding region” refers to the cleft region or groove of the MHC moleculewhere peptide binding occurs.

As used herein the terms “canonical” and “non-canonical” are used torefer to the orientation of an amino acid sequence. Canonical refers toan amino acid sequence presented or read in the N terminal to C terminalorder; non-canonical is used to describe an amino acid sequencepresented in the inverted or C terminal to N terminal order.

As used herein, the term “haplotype” refers to the HLA alleles found onone chromosome and the proteins encoded thereby. Haplotype may alsorefer to the allele present at any one locus within the MHC.

As used herein, the term “polypeptide sequence that binds to at leastone major histocompatibility complex (MHC) binding region” refers to apolypeptide sequence that is recognized and bound by one more particularMHC binding regions as predicted by the neural network algorithmsdescribed herein or as determined experimentally.

As used herein, the term “allergen” refers to an antigenic substancecapable of producing immediate hypersensitivity and includes bothsynthetic as well as natural immunostimulant peptides and proteins.

As used herein, the term “transmembrane protein” refers to proteins thatspan a biological membrane. There are two basic types of transmembraneproteins. Alpha-helical proteins are present in the inner membranes ofbacterial cells or the plasma membrane of eukaryotes, and sometimes inthe outer membranes. Beta-barrel proteins are found only in outermembranes of Gram-negative bacteria, cell wall of Gram-positivebacteria, and outer membranes of mitochondria and chloroplasts.

As used herein, the term “external loop portion” refers to the portionof transmembrane protein that is positioned between twomembrane-spanning portions of the transmembrane protein and projectsoutside of the membrane of a cell.

As used herein, the term “tail portion” refers to refers to ann-terminal or c-terminal portion of a transmembrane protein thatterminates in the inside (“internal tail portion”) or outside (“externaltail portion”) of the cell membrane.

As used herein, the term “secreted protein” refers to a protein that issecreted from a cell.

As used herein, the term “membrane motif” refers to an amino acidsequence that encodes a motif not in a canonical transmembrane domainbut which would be expected by its function deduced in relation to othersimilar proteins to be located in a cell membrane, such as those listedin the publically available psortb database.

As used herein, the term “consensus peptidase cleavage site” refers toan amino acid sequence that is recognized by a peptidase such as trypsinor pepsin.

As used herein, the term “affinity” refers to a measure of the strengthof binding between two members of a binding pair, for example, anantibody and an epitope and an epitope and a MHC-I or II haplotype.K_(d) is the dissociation constant and has units of molarity. Theaffinity constant is the inverse of the dissociation constant. Anaffinity constant is sometimes used as a generic term to describe thischemical entity. It is a direct measure of the energy of binding. Thenatural logarithm of K is linearly related to the Gibbs free energy ofbinding through the equation Δ_(G0)=−RT LN(K) where R=gas constant andtemperature is in degrees Kelvin. Affinity may be determinedexperimentally, for example by surface plasmon resonance (SPR) usingcommercially available Biacore SPR units (GE Healthcare) or in silico bymethods such as those described herein in detail. Affinity may also beexpressed as the ic50 or inhibitory concentration 50, that concentrationat which 50% of the peptide is displaced. Likewise ln(ic50) refers tothe natural log of the ic50.

As used herein, the term “immunogen” refers to a molecule whichstimulates a response from the adaptive immune system, which may includeresponses drawn from the group comprising an antibody response, acytotoxic T cell response, a T helper response, and a T cell memory. Animmunogen may stimulate an upregulation of the immune response with aresultant inflammatory response, or may result in down regulation orimmunosuppression. Thus the T-cell response may be a T regulatoryresponse.

As used herein, the term “flanking” refers to those amino acid positionslying outside, but immediately adjacent to a peptide sequence ofinterest. For example amino acids flanking the MHC binding region liewithin 1-4 amino acids of either terminal amino acid of the MHC bindingpeptide.

As used herein, the terms “computer memory” and “computer memory device”refer to any storage media readable by a computer processor. Examples ofcomputer memory include, but are not limited to, RAM, ROM, computerchips, digital video disc (DVDs), compact discs (CDs), hard disk drives(HDD), and magnetic tape.

As used herein, the term “computer readable medium” refers to any deviceor system for storing and providing information (e.g., data andinstructions) to a computer processor. Examples of computer readablemedia include, but are not limited to, DVDs, CDs, hard disk drives,magnetic tape and servers for streaming media over networks.

As used herein, the terms “processor” and “central processing unit” or“CPU” are used interchangeably and refer to a device that is able toread a program from a computer memory (e.g., ROM or other computermemory) and perform a set of steps according to the program.

As used herein, the term “principal component analysis” refers to amathematical process which reduces the dimensionality of a set of data(Wold, S., Sjorstrom, M., and Eriksson, L., Chemometrics and IntelligentLaboratory Systems 2001. 58: 109-130. Multivariate and Megavariate DataAnalysis Basic Principles and Applications (Parts I&II) by L. Eriksson,E. Johansson, N. Kettaneh-Wold, and J. Trygg, 2006 2^(nd) Edit. UmetricsAcademy). Derivation of principal components is a linear transformationthat locates directions of maximum variance in the original input data,and rotates the data along these axes. For n original variables, nprincipal components are formed as follows: The first principalcomponent is the linear combination of the standardized originalvariables that has the greatest possible variance. Each subsequentprincipal component is the linear combination of the standardizedoriginal variables that has the greatest possible variance and isuncorrelated with all previously defined components. Thus a feature ofprincipal component analysis is that it determines the weighting andranking of each principal component and thus the relative contributioneach makes to the underlying variance. Further, the principal componentsare scale-independent in that they can be developed from different typesof measurements.

As used herein, the term “vector” when used in relation to a computeralgorithm or the present invention, refers to the mathematicalproperties of the amino acid sequence.

As used herein, the term “vector,” when used in relation to recombinantDNA technology, refers to any genetic element, such as a plasmid, phage,transposon, cosmid, chromosome, retrovirus, virion, etc., which iscapable of replication when associated with the proper control elementsand which can transfer gene sequences between cells. Thus, the termincludes cloning and expression vehicles, as well as viral vectors.

The term “isolated” when used in relation to a nucleic acid, as in “anisolated oligonucleotide” refers to a nucleic acid sequence that isidentified and separated from at least one contaminant nucleic acid withwhich it is ordinarily associated in its natural source. Isolatednucleic acids are nucleic acids present in a form or setting that isdifferent from that in which they are found in nature. In contrast,non-isolated nucleic acids are nucleic acids such as DNA and RNA thatare found in the state in which they exist in nature.

The terms “in operable combination,” “in operable order,” and “operablylinked” as used herein refer to the linkage of nucleic acid sequences insuch a manner that a nucleic acid molecule capable of directing thetranscription of a given gene and/or the synthesis of a desired proteinmolecule is produced. The term also refers to the linkage of amino acidsequences in such a manner so that a functional protein is produced.

A “subject” is an animal such as vertebrate, preferably a mammal such asa human, a bird, or a fish. Mammals are understood to include, but arenot limited to, murines, simians, humans, bovines, cervids, equines,porcines, canines, felines etc.).

An “effective amount” is an amount sufficient to effect beneficial ordesired results. An effective amount can be administered in one or moreadministrations,

As used herein, the term “purified” or “to purify” refers to the removalof undesired components from a sample. As used herein, the term“substantially purified” refers to molecules, either nucleic or aminoacid sequences, that are removed from their natural environment,isolated or separated, and are at least 60% free, preferably 75% free,and most preferably 90% free from other components with which they arenaturally associated. An “isolated polynucleotide” is therefore asubstantially purified polynucleotide.

As used herein “regulatory T cells” refers to a subpopulation of T cellswhich are immunosuppressive and downregulate the immune system. Tregulatory cells may function to maintain tolerance to self-antigens,and downregulate autoimmune disease. Regulatory T cells may beabbreviated herein as T_(reg), or Treg.

As used herein “orthogonal” refers to a parameter or a mathematicalexpression which is statistically independent from another suchparameter or mathematical expression. Hence orthogonal and uncorrelatedare used interchangeably herein.

As used herein “bagging” which is an abbreviated term for “bootstrapaggregation” is used to describe a process whereby small, balancedsubsets of data are selected randomly from a larger training dataset andprocessed, followed by processing of a further randomly selected subsetof data from the same dataset. This cycle is repeated multiple timeswith different random subsets from the same dataset. This process isused for training and validation of the classifiers, then allowing theresulting predictors to be applied to larger datasets. For instance 5k-fold cross validation may be performed 5 times, each time startingwith a different seed for the random number generator.

As used herein “ensemble” is used to describe a collection of similarequations or computer processes each contributing to an overallanalysis. In some instances an ensemble may be a series of predictiveequations each focused on a prediction specific to a particularcircumstance. Ensembles may be used as a form of analysis as a“committee” where a determination is based on the “vote” of each memberof the ensemble. Thus if 8 of 10 equations predict outcome “A” and 2 of10 predict outcome “not A” then a prediction of 0.8 probability of A ismade

As used herein “enzyme class” and the associated numbering of enzymeclasses refers to the classification system developed by theNomenclature Committee of the International Union of Biochemistry andMolecular Biology on the Nomenclature and Classification of Enzymes bythe Reactions they Catalyse. This standardized nomenclature may beaccessed at the website www.chem.qmul.ac.uk/iubmb/enzyme/

DETAILED DESCRIPTION OF THE INVENTION

This invention relates to the identification of predicted peptidasecleavage sites in proteins or peptides. Predictions of some peptidasecleavage sites like trypsin are quite simple. For example, trypsincleaves the bond located on the C terminus side of a lysine or arginine.Other peptidases are much more discriminating and the context of theircleavage sites is much more complex. Hence definition of their cleavagesite has been challenging, as simple observation of amino acid sequencesdoes not provide indications of characteristic sequence motifs.

The accurate prediction of peptidase cleavage sites is of great utilityas it can allow application of the output of experimental data fromexperiments on one or a few proteins, often the result of expensive andcumbersome procedures, to be applied to many other proteins of interest.Hence the prediction of peptidase cleavage positions within proteins andpeptides is an important objective. In silico prediction schemes havebeen developed using several types of algorithms, and the topic has beenreviewed recently. See, e.g., Shen H B, Chou K C (2009) Identificationof proteases and their types. Anal Biochem 385: 153-160; Chou K C, ShenH B (2008) Protldent: a web server for identifying proteases and theirtypes by fusing functional domain and sequential evolution information.Biochem Biophys Res Commun 376: 321-325; Lohmuller T, Wenzler D,Hagemann S, Kiess W, Peters C, et al. (2003) Toward computer-basedcleavage site prediction of cysteine endopeptidases. Biol Chem 384:899-909; Song J, Tan H, Boyd S E, Shen H, Mahmood K, et al. (2011)Bioinformatic approaches for predicting substrates of proteases. JBioinform Comput Biol 9: 149-178. All of these predictors are “binaryclassifiers”, which rely on standard alphabetic representation of aminoacids, and assign a probability to whether the scissile bond between aparticular amino acid pair will be cleaved or not. In use, they are“trained” with a set of known sequences and then those results areextrapolated to other situations with various scoring metrics used toassess how well the classifiers perform outside of their “training”environment. A number of classifiers have been shown to performreasonably well for a small number of enzymes, typically those with highlevels of sequence specificity within the CSO, and trained with a verylimited number of cleavage targets. See, e.g., Song et al.; Yang Z R(2005) Prediction of caspase cleavage sites using Bayesian bio-basisfunction neural networks. Bioinformatics 21: 1831-1837. Development ofrule-based predictors can be a complex task and often the rules derivedare difficult to understand. See, e.g., Rognvaldsson T, Etchells T A,You L, Garwicz D, Jarman I, et al. (2009) How to find simple andaccurate rules for viral protease cleavage specificities. BMCBioinformatics 10: 149.

Several additional factors have hampered the peptidase prediction field.For any particular peptidase only relatively small data sets areavailable, given the laborious experimental approaches needed to definethe cleavage characteristics. Hence, classifiers have used training setsonly up to several hundred cleavages for any particular peptidase. See,e.g., Song et al vide supra. The ability to generalize beyond a smalltraining set is questionable and often difficult to evaluate. There isalso likely a high degree of bias in the existing datasets because theexperimental approaches used to generate the data are purposefullyexecuted with highly non-random peptide selections. Therefore theexperimental techniques have the potential to bias the predictions basedon them.

An additional complicating factor is the need for “true negatives” inorder to develop any binary classifier. Experimentalists' understandablefocus on positive experimental results exacerbates this issue wheneverexperimental data is used in classifier development. The mass of dataproduced by CSL techniques, and the randomness of the peptide sequences,should provide the basis for more reliable predictions of proteolyticcleavage than have heretofore been possible. In addition, thisexperimental approach provides the opportunity for the peptidases tobind to a very large number of CSOs in protein mixtures and thereforealso provides a large set of true negative CSO peptides that have beenexposed to the enzyme for extended periods of time without resulting incleavage.

In some embodiments, the ability to predict peptidases also facilitatesthe development of drugs which are peptidase inhibitors.

Proteome information is now available for many organisms and the list ofavailable proteomes is increasing daily. Mapping locations withinproteins at which peptidases cleave has been a laborious processinvolving either chemical synthesis of many substrates or isolation ofpeptides from post-cleavage mixtures, determination of their sequencesand deduction of their location within the parent protein molecule.Several new proteomic techniques have recently been developed thatenable identification of newly created cleavage sites through acombination of isotopic or other labeling of the residues flanking thecleavage site and mass spectrometry. See, e.g., Kleifeld O, Doucet A,auf dem KU, Prudova A, Schilling O, et al. (2010) Isotopic labeling ofterminal amines in complex samples identifies protein N-termini andprotease cleavage products. Nat Biotechnol 28: 281-288; Doucet A, ButlerG S, Rodriguez D, Prudova A, Overall C M (2008) Metadegradomics: towardin vivo quantitative degradomics of proteolytic post-translationalmodifications of the cancer proteome. Mol Cell Proteomics 7: 1925-1951;auf dem KU, Schilling O (2010) Proteomic techniques and activity-basedprobes for the system-wide study of proteolysis. Biochimie 92:1705-1714; Impens F, Colaert N, Helsens K, Plasman K, Van D P, et al.(2010) MS-driven protease substrate degradomics. Proteomics 10:1284-1296; Agard N J, Wells J A (2009) Methods for the proteomicidentification of protease substrates. Curr Opin Chem Biol 13: 503-509.These techniques are capable of deducing thousands of new cleavagelocations in mixtures containing hundreds of proteins in a singleexperiment. Thus, one experiment can characterize hundreds of times asmany cleavage events as had been previously catalogued for a particularpeptidase. These approaches fall into two broad categories: apeptide-centric two stage processes where a protein mixture is firstfragmented with a first protease before the protease under study isdeployed, and a second, more biological process where the protease understudy is given access to a mixture of intact protein molecules. See,e.g., Kleifeld et al., Doucet et al., Schilling et al., and Impens etal., supra.

The cleavage site repertoire of a specific peptidase in a particular setof experimental conditions is itself useful, but given the combinatorialdiversity of proteins, there is a need to be able to extrapolate beyondthe experimental results obtained and to use these techniques as thebasis to build a system to reliably predict cleavage in any protein.

The region around the active site of peptidases has been conceptualizedas an enzyme binding pocket comprising a series of contiguoustopographic subsites following on the work of Schechter and Berger withpapain. On the cleaved protein the subsites are consecutively numberedoutward from the cleavage site between P1 and P1′ asP4-P3-P2-P1|P1′-P2′-P3′-P4′ and the corresponding binding sites on theenzyme numbered S4 . . . S4′. Although recent work has indicated theimportance of amino acid residues outside this region, the cleavage siteoctomer (CSO) is conventionally the focus of research efforts on thecleavage event. See, e.g., Ng N M, Quinsey N S, Matthews A Y, KaisermanD, Wijeyewickrema L C, et al. (2009) The effects of exosite occupancy onthe substrate specificity of thrombin. Arch Biochem Biophys 489: 48-54.Several online databases (cutdb.burnham.org/; merops.sanger.ac.uk/;clipserve.clip.ubc.ca/pics/) contain extensive catalogs of cleavageevents and online prediction tools are also available. Perhaps the mostcomprehensive is the POPS web-based prediction system(pops.csse.monash.edu.au/pops-cgi/). Boyd S E, Pike R N, Rudy G B,Whisstock J C, Garcia dlB (2005) PoPS: a computational tool for modelingand predicting protease specificity. J Bioinform Comput Biol 3: 551-585.

The field of bioinformatics has provided powerful tools to analyze largedatasets arising from sequenced genomes, proteomes and transcriptomes.But often analysis of the proteomic information has been based onindividual amino acids, using sequences, not segments, and withouttranslation to structure, biological function and location of theproteins in the whole organism.

For the reasons stated above there is a need for a method to identifypeptidase cleavage sites. Accordingly, in some embodiments, the presentinvention provides computer implemented processes of identifyingpeptidase cleavage sites within polypeptides and proteins and thepredicted peptide and polypeptide fragments that are generated bypeptidase activity.

Principal component analysis is a very powerful statistical tool that isbeing used increasingly to reduce the dimensionality of large data setsin data mining applications and in systems biology analytics. Theinventors recently showed that the principal components of amino acid(PCAA) physical properties could be used in combination with largetraining sets in public databases to predict the binding affinity ofpeptides to major histocompatability complex I (MHC I) and MHC IImolecules. Bremel R D, Homan E J (2010) An integrated approach toepitope analysis I: Dimensional reduction, visualization and predictionof MHC binding using amino acid principal components and regressionapproaches. Immunome Res 6: 7. 1745-7580-6-7; Bremel R D, Homan E J(2010) An integrated approach to epitope analysis II: A system forproteomic-scale prediction of immunological characteristics. ImmunomeRes 6: 8. 1745-7580-6-8).

The current invention provides inventive applications of statisticalwork of Wold and his colleagues who introduced the concept of amino acidprincipal components of small peptides in a predictive way with apartial least squares (PLS) regression process. See, e.g., Sjostrom M,Eriksson L, Hellberg S, Jonsson J, Skagerberg B, et al. (1989) PeptideQSARS: PLS modelling and design in principal properties, Prog Clin BiolRes 291: 313-317; Hellberg S, Sjostrom M, Skagerberg B, Wold S (1987)Peptide quantitative structure-activity relationships, a multivariateapproach. J Med Chem 30: 1126-1135; Linusson A, Elofsson M, Andersson IE, Dahlgren M K (2010) Statistical molecular design of balanced compoundlibraries for QSAR modeling. Curr Med Chem 17: 2001-2016; Linusson A,Gottfries J, Lindgren F, Wold S (2000) Statistical molecular design ofbuilding blocks for combinatorial chemistry. J Med Chem 43: 1320-1328.

PLS concepts have played a major role in the growth of the field ofmedicinal chemistry chemometrics but have not been exploited in thefield of bioinformatics. The similarity between the octomer bindingpocket of a peptidase and the binding pocket of an MHC molecule isobvious and suggested that perhaps a similar process might be developedfor predicting peptidase cleavage probabilities instead of bindingaffinities. Our preliminary work in this area prior to the availabilityof the CSL experimental results identified the generally small datasetsfor training the classifiers as being a fundamental limitation. The CSLdatasets now becoming available provide training sets of adequate sizesto produce reliable prediction tools.

In some embodiments, the present invention provides processes,preferably computer implemented, for the derivation of ensembles ofequations for the prediction of peptidase cleavage. In some embodimentsthe process comprises generating mathematical expressions based onmultiple uncorrelated physical parameters of amino acids, wherein saidmathematical expressions serve as descriptors of a peptide. The peptidedescriptor is then applied to a training set of peptides for which thecleavage site and probability has been experimentally determined. Themathematical descriptors and the experimental data are then compared anda prediction equation derived for cleavage of the scissile bond betweeneach possible pair of amino acids located in the cleavage site dimerpositions. In some embodiments the process is then repeated to derive anequation for each possible pair of amino acids in a cleavage site dimer.The assemblage of equations for every possible cleavage site dimer thenconstitutes an ensemble of predictive equations which can be usedtogether to determine a probability of cleavage.

In some embodiments, the mathematical expression which is a peptidedescriptor is derived by analyzing more than one uncorrelated physicalparameters of an amino acid via a computer processor, and constructing acorrelation matrix of said physical parameters. In some embodiments thispermits the derivation of multiple mutually orthogonal or uncorrelatedproxies, wherein said proxies are weighted and ranked to providedescriptors of the amino acids. In further embodiments a number of theproxies which contribute most to the description of the amino acidvariability are then selected to serve as descriptors. In someembodiments this number of proxies may be three or more. In someembodiments the proxies are principal components. In furtherembodiments, by combining the mathematical expression comprising severalproxies describing each amino acid in a peptide, a mathematicaldescriptor for the peptide is derived.

In some embodiments, the computer assisted process of assembling anensemble of equations to predict peptide cleavage requires firstderiving a predictive equation for each cleavage site dimer pair ofamino acids. By examining a set of peptides for which the cleavage isknown, cleavage site dimers are identified which are comprised ofidentical amino acids but which are located in peptides that are eithercleaved or uncleaved. A sub set of peptides with these properties israndomly selected. By comparison of the cleaved and uncleaved peptideswith the aforesaid peptide descriptors, a first equation is derived topredict cleavage for that cleavage site dimer. This process is repeatedon a second random sub set of the peptides and then repeated multipletimes, each time enhancing the precision of the predictive equation forthe particular cleavage site dimer. In some embodiments, the derivationof a predictive equation is then conducted for other cleavage site dimeramino acid pairs until the maximum of 400 possible pairs has beenexamined and the corresponding predictive equations derived form anensemble of predictive equations.

In some embodiments, the predictive equations are then applied to aprotein of interest, wherein the invention provides for inputting theprotein of interest into a computer and applying amino acid descriptorsbased on multiple uncorrelated physical parameters to provide a peptidedescriptor for each peptide comprised of a subset of amino acids fromwithin the protein of interest. In some embodiments the process thenfurther comprises applying the peptidase prediction equation ensemble topredict the cleavage site dimers in the peptides from the protein ofinterest and the probability of cleavage of each cleavage site dimer. Insome particular embodiments the probability of cleavage may be 60% or70% or 80% or 90% or higher.

In some embodiments, the present invention provides computer implementedprocesses of identifying peptides that interact with a peptidase andapplying a mathematical expression to predict the interaction (e.g.,cleavage) of the amino acids subset with the partner using a classifier.

In some embodiments, the classifier is a probabilistic classifier. Insome embodiments, the probabilistic classifier is a probabilistictrained neural network. In yet other embodiments, the probabilisticclassifier is a support vector machine. In yet further embodiments, theprobabilistic classifier is a recursive partitioning algorithm. Otherclassifiers may be applied, thus these examples are not limiting.

In some preferred embodiments, the methods are used to predict peptidasecleavage using a neural network prediction scheme based on amino acidphysical property principal components. Briefly, a protein is brokendown into 8-mer peptides each offset by 1 amino acid. The peptide 8-mersare converted into vectors of principal components wherein each aminoacid in a 8-mer is replaced by three z-scale descriptors. {z1(aa1),z2(aa1), z3(aa1)}, {z1(aa2), z2(aa2), z3(aa2)}, {z1(aa8), z2(aa8),z3(aa8} that are effectively physical property proxy variables. Withthese descriptors ensembles of neural network prediction equation setsare developed, using publicly available datasets of peptidase cleavagesites derived from experimentation. See, e.g., Impens et al., Bianosseket al., and Tholen et al., referenced above. In preferred embodiments,the peptide data is indexed to the N-terminal amino acid and thus eachprediction corresponds to the 8-amino acid peptide downstream from theindex position.

The methodology described herein enables the description of predictedpeptidase cleavage of a protein, based on the use of principalcomponents as proxies for the salient physical parameters of thepeptide. Having used the principal components to reduce thedimensionality of the descriptors to a mathematical expression which isa descriptor of the peptide and its component amino acids, it is thenpossible to analyze statistically the peptide within which cleavageoccurs.

In some preferred embodiments, the processes described above are appliedto understanding the context of cleavage of endopeptidases. In someembodiments the endopeptidases are in enzyme class 3.4.21, 3.4.22,3.4.23, or 3.4.21 and include serine proteases, cysteine proteases,aspartic acid proteases, and metalloendopeptidases. In other preferredembodiments, the peptidase is a cathepsin. Representative cathepsinsinclude, but are not limited to, Cathepsin A (serine protease),Cathepsin B (cysteine protease), Cathepsin C (cysteine protease),Cathepsin D (aspartyl protease), Cathepsin E (aspartyl protease),Cathepsin F (cysteine proteinase), Cathepsin G (serine protease),Cathepsin H (cysteine protease), Cathepsin K (cysteine protease),Cathepsin L1 (cysteine protease), Cathepsin L2 (or V) (cysteineprotease), Cathepsin O (cysteine protease), Cathepsin S (cysteineprotease), Cathepsin W (cysteine proteinase), and Cathepsin Z (or X)(cysteine protease).

In one particular embodiment, principal component analysis is applied toidentify peptidase cleavage sites in target proteins. These and otherembodiments are described in more detail below.

A. Limitations of Position Scoring Matrices

Position specific scoring matrices (PSSM) are widely used inbioinformatics analysis and have been used to characterize patterns ofamino acids around the cleavage site within the CSO on the N-terminaland C-terminal side of the scissile bond. The MEROPS database providesboth a tabular and a graphic description of the CSO, depicting thefrequency of occurrence of amino acids at each position. However afundamental weakness of PSSM is that it assumes position independenceand cannot be used to assess combinatorial relationships andinteractions among the different positions within the CSO. Whileproviding a basis for graphically depicting the amino acids at differentlocations, the matrices are not in themselves useful in predictiveprocesses. Predictive systems such as POPS use other additionalbiophysical data and predictions in conjunction with the PSSM in arule-based system to derive cleavage probability predictions.

B. Application of Proteome Information

In some embodiments, the present invention provides processes that makeit possible to analyze proteomic-scale information on a personalcomputer, using commercially available statistical software and databasetools in combination with several unique computational procedures. Thepresent invention improves computational efficiency by utilizing aminoacid principal components as proxies for physical properties of theamino acids, rather than a traditional alphabetic substitution matrixbioinformatics approach. A particular advantage of principal componentanalysis is that the weighting and ranking of the principal componentsreflect the contribution of each to the underlying variance. Principalcomponents thus provide uncorrelated proxies which are weighted andranked. This has allowed new, more accurate and more efficientprocedures for peptidase cleavage site definition to be realized.

A proteome (1) is a database table consisting of all of the proteinsthat are predicted to be coded for in an organism's genome. A largenumber of proteomes are publicly available from Genbank in an electronicform that have been curated to describe the known or putativephysiological function of the particular protein molecule in theorganism. Advances in DNA sequencing technology now makes it possible tosequence an entire organism's genome in a day and will greatly expandthe availability of proteomic information. Having many strains of thesame organism available for analysis will improve the potential fordefining protease motifs universally. However, the masses of dataavailable will also require that tools such as those described in thisspecification be made available to a scientist without the limitationsof those resources currently available over the internet.

Proteins are uniquely identified in genetic databases. The Genbankadministrators assign a unique identifier to the genome (GENOME) of eachorganism strain. Likewise a unique identifier called the Gene Index (GI)is assigned to each gene and cognate protein in the genome. As theGENOME and GI are designed to be unique identifiers they are used inthis specification in all database tables and to track the proteins asthe various operations are carried out. By convention the amino acidsequences of proteins are written from N-terminus (left) to C-terminus(right) corresponding to the translation of the genetic code. A 1-basednumbering system is used where the amino acid at the N-terminus isdesignated number 1, counting from the signal peptide methionine. Atvarious points in the process it is necessary to unambiguously identifythe location of a certain amino acid or groups of amino acids. For thispurpose, a four component addressing system has been adopted that hasthe four elements separated by dots (Genome.GI.N.C).

Referring to FIG. 8, in some embodiments, a Proteome of interest isobtained in “FASTA” format via FTP transfer from the Genbank website.This format is widely used and consists of a single line identifierbeginning with a single “>” and contains the GENOME and GI plus theprotein's curation and other relevant organismal information followed bythe protein sequence itself. In addition to the FASTA formatted file adatabase table is created that contains all of the information.

In some embodiments, principal components of amino acids are utilized togenerate a mathematical algorithm which provides a peptide descriptorthat encompasses the variability derived from the physical properties ofa a peptide, based on the amino acids therein. By analyzing suchmathematical descriptors in relation to observed experimental dataensembles of predictive equations can be developed which in aggregatecan peptidase accurately predict a cleavage site.

C. Application of Principal Component Analysis Principal ComponentsAnalysis is a mathematical process that is used in many differentscientific fields and which reduces the dimensionality of a set of data.(Bishop, C. M., Neural Networks for Pattern Recognition. OxfordUniversity Press, Oxford 1995. Bouland, H. and Kamp, Y., BiologicalCybernetics 1988. 59: 291-294.).

In the present invention, Principal Component Analysis is used to derivea mathematical algorithm which serves as a descriptor of an amino acids,representing the variance in physical properties. By combination ofalgorithms derived to describe single amino acids, descriptors forpeptides can be assembled and used as descriptors for peptides in whichthe interdependency of the amino acids therein is accounted for. Theseveral principal components are used to describe each amino acidfunction as uncorrelated or orthogonal proxies.

Derivation of principal components is a linear transformation thatlocates directions of maximum variance in the original input data, androtates the data along these axes. Typically, the first severalprincipal components contain the most information. Principal componentanalysis is particularly useful for large datasets with many differentvariables. Using principal components provides a way to picture thestructure of the data as completely as possible by using as fewvariables as possible. For n original variables, n principal componentsare formed as follows: The first principal component is the linearcombination of the standardized original variables that has the greatestpossible variance. Each subsequent principal component is the linearcombination of the standardized original variables that has the greatestpossible variance and is uncorrelated with all previously definedcomponents. Further, the principal components are scale-independent inthat they can be developed from different types of measurements. Forexample, datasets from HPLC retention times (time units) or atomic radii(cubic angstroms) can be consolidated to produce principal components.Another characteristic is that principal components are weightedappropriately for their respective contributions to the response and onecommon use of principal components is to develop appropriate weightingsfor regression parameters in multivariate regression analysis. Outsidethe field of immunology, principal components analysis (PCA) is mostwidely used in regression analysis. Initial tests were conducted usingthe principal components in a multiple regression partial least squares(PLS) approach. See, e.g., Bouland H, Kamp Y (1988) Auto association bymultilayer perceptrons and singular value deomposition. BiologicalCybernetics 59: 291-294). Principal component analysis can berepresented in a linear network. PCA can often extract a very smallnumber of components from quite high-dimensional original data and stillretain the important structure.

Over the past half century a wide array studies of physicochemicalproperties of amino acids have been made. Others have made tabulationsof principal components, for example in the paper Wold et al. thatdescribes the mathematical theory underlying the use of principalcomponents in partial least squares regression analysis. Wold S,Sjorstrom M, Eriksson L (2001) PLS-regression: a basic tool ofchemometrics. Chemometrics and Intelligent Laboratory Systems 58:109-130. The work of Wold et al uses eight physical properties.

Accordingly, in some embodiments, physical properties of amino acids areused for subsequent analysis. In some embodiments, the compiled physicalproperties are available at a proteomics resource website(expasy.org/tools/protscale.html). In some embodiments, the physicalproperties comprise one or more physical properties derived from the 31different studies as shown in Table 1. In some embodiments, the data foreach of the 20 different amino acids from these studies are tabulated,resulting in 20×31 different datapoints, each providing a uniqueestimate of a physical characteristic of that amino acid. The power ofprincipal component analysis lies in the fact that the results of all ofthese studies can be combined to produce a set of mathematicalproperties of the amino acids which have been derived by a wide array ofindependent methodologies. The patterns derived in this way are similarto those of Wold et. al. but the absolute numbers are different. Thephysicochemical properties derived in the studies used for thiscalculation are shown in (Table 1). FIG. 2 shows Eigen values for the19-dimensional space describing the principal components, and furthershows that the first three principal component vectors account forapproximately 89.2% of the total variation of all physicochemicalmeasurements in all of the studies in the dataset. All subsequent workdescribed herein is based on use of the first three principalcomponents.

TABLE 1 1 Polarity. Zimmerman, J. M., Eliezer, N., and Simha, R., J.Theor. Biol. 1968. 21: 170- 201. 2 Polarity (p). Grantham, R., Science1974. 185: 862- 864. 3 Optimized matching hydrophobicity Sweet, R. M.and Eisenberg, D., (OMH). J. Mol. Biol. 1983. 171: 479-488. 4Hydropathicity. Kyte, J. and Doolittle,R. F.,. J. Mol. Biol. 1982. 157:105-132. 5 Hydrophobicity (free energy of Bull, H. B. and Breese, K.,transfer to surface in kcal/mole). Arch. Biochem. Biophys. 1974. 161:665- 670. 6 Hydrophobicity scale based on free Guy, H. R., Biophys. J.1985. 47: 61-70. energy of transfer (kcal/mole). 7 Hydrophobicity (deltaG1/2 cal) Abraham, D. J. and Leo, A. J., Proteins 1987. 2: 130-152. 8Hydrophobicity scale (contact energy Miyazawa, S. and Jernigan, R. L.,derived from 3D data). Macromolecules 1985. 18: 534-552. 9Hydrophobicity scale (pi-r). Roseman, M. A., J. Mol. Biol. 1988. 200:513-522. 10 Molar fraction (%) of 2001 buried Janin, J., Nature 1979.277: 491-492. residues. 11 Proportion of residues 95% buried (inChothia, C., J. Mol. Biol. 1976. 105: 1-12. 12 proteins). 12 Free energyof transfer from inside to Janin, J., Nature 1979. 277: 491-492. outsideof a globular protein. 13 Hydration potential (kcal/mole) at Wolfenden,R., Andersson, L., Cullis, P. M., 25ø C. and Southgate, C. C.,Biochemistry 1981. 20: 849-855. 14 Membrane buried helix parameter. Rao,M. J. K. and Argos, P., Biochim. Biophys. Acta 1986. 869: 197- 214. 15Mean fractional area loss (f) [average Rose, G. D., Geselowitz, A. R.,Lesser, G. J., area buried/standard state area]. Lee, R. H., and Zehfus,M. H., Science 1985. 229: 834-838. 16 Average area buried on transferfrom Rose, G. D., Geselowitz, A. R., Lesser, G. J., standard state tofolded protein. Lee, R. H., and Zehfus, M. H., Science 1985. 229:834-838. 17 Molar fraction (%) of 3220 accessible Janin, J., Nature1979. 277: 491-492. residues. 18 Hydrophilicity. Hopp, T. P., MethodsEnzymol. 1989. 178: 571-585. 19 Normalized consensus Eisenberg, D.,Schwarz, E., hydrophobicity scale. Komaromy, M., and Wall, R., J. Mol.Biol. 1984. 179: 125-142. 20 Average surrounding hydrophobicity.Manavalan, P. and Ponnuswamy, P. K., Nature 1978. 275: 673-674. 21Hydrophobicity of physiological L- Black, S. D. and Mould, D. R., alphaamino acids Anal. Biochem. 1991. 193: 72-82 22 Hydrophobicity scale(pi-r)2. Fauchere, J. L., Charton, M., Kier, L. B., Verloop, A., andPliska, V., Int. J. Pept. Protein Res. 1988. 32: 269-278. 23 Retentioncoefficient in HFBA. Browne, C. A., Bennett, H. P., and Solomon, S.,Anal. Biochem. 1982. 124: 201-208. 24 Retention coefficient in HPLC, pHMeek, J. L., Proc. Natl. Acad. Sci. U.S.A 2.1. 1980. 77: 1632-1636. 25Hydrophilicity scale derived from Parker, J. M., Guo, D., and Hodges, R.S., HPLC peptide retention times. Biochemistry 1986. 25: 5425-5432. 26Hydrophobicity indices at ph 7.5 Cowan, R. and Whittaker, R. G., Pept.Res. determined by HPLC. 1990. 3: 75-80. 27 Retention coefficient in TFABrowne, C. A., Bennett, H. P., and Solomon, S., Anal. Biochem. 1982.124: 201-208. 28 Retention coefficient in HPLC, pH Meek, J. L., Proc.Natl. Acad. Sci. U.S.A 7.4 1980. 77: 1632-1636. 29 Hydrophobicityindices at pH 3.4 Cowan, R. and Whittaker, R. G., Pept. Res. determinedby HPLC 1990. 3: 75-80. 30 Mobilities of amino acids on Akintola, A. andAboderin, A. A., chromatography paper (RF) Int. J. Biochem. 1971. 2:537-544. 31 Hydrophobic constants derived from Wilson, K. J., Honegger,A., Stotzel, R. P., HPLC peptide retention times and Hughes, G. J.,Biochem. J. 1981. 199: 31-41.

In some embodiments, principal component vectors derived are shown inTable 2. Each of the first three principal components is sorted todemonstrate the underlying physicochemical properties most closelyassociated with it. From this it can be seen that the first principalcomponent (Prin1) is an index of amino acid polarity or hydrophobicity;the most hydrophobic amino acids have the highest numerical value. Thesecond principal component (Prin2) is related to the size or volume ofthe amino acid, with the smallest having the highest score. Thephysicochemical properties embodied in the third component (Prin3) arenot immediately obvious, except for the fact that the two amino acidscontaining sulfur rank among the three smallest magnitude values.

TABLE 2 Amino Amino Amino acid Prin1 Acid Prin2 Acid Prin3 K −6.68 W−3.50 C −3.84 R −6.30 R −2.93 H −1.94 D −6.04 Y −2.06 M −1.46 E −5.70 F−1.53 E −1.46 N −4.35 K −1.32 R −0.91 Q −3.97 H −1.00 V −0.35 S −2.65 Q−0.47 D −0.18 H −2.55 M −0.43 I 0.04 T −1.42 P −0.36 F 0.05 G −0.76 L−0.20 Q 0.15 P −0.03 D 0.03 W 0.16 A 0.72 N 0.21 N 0.30 C 2.11 I 0.29 Y0.37 Y 2.58 E 0.34 T 0.94 M 4.14 T 0.80 K 1.16 V 4.79 S 1.84 L 1.17 W5.68 V 1.98 G 1.21 L 6.59 A 2.48 S 1.30 I 6.65 C 2.74 A 1.42 F 7.18 G3.08 P 1.87

In some embodiments, the systems and processes of the present inventionuse from about one to about 10 or more vectors corresponding to aprincipal component. In some embodiments, for example, either one orthree vectors are created for the amino acid sequence of the protein orpeptide subsequence within the protein. The vectors represent themathematical properties of the amino acid sequence and are created byreplacing the alphabetic coding for the amino acid with the relevantmathematical properties embodied in each of the three principalcomponents.

D. Artificial Neural Network Regression.

In some embodiments, the present invention provides and utilizesprobabilistic neural networks that predict peptidase cleavage sites. Aneural network is a powerful data modeling tool that is able to captureand represent complex input/output relationships. The motivation for thedevelopment of neural network technology stemmed from the desire todevelop an artificial system that could perform “intelligent” taskssimilar to those performed by the human brain. Neural networks resemblethe human brain in the following two ways: a neural network acquiresknowledge through learning and a neural network's knowledge is storedwithin inter-neuron connection strengths known as synaptic weights (i.e.equations). Whether the principal components could be used in thecontext of a NN platform was tested, Some work has been reportedrecently using actual physical properties and neural networks in what iscalled a quantitative structure activity relationship (QSAR). See, e.g.,Tian F, Lv F, Zhou P, Yang Q, Jalbout A F (2008) Toward prediction ofbinding affinities between the MHC protein and its peptide ligands usingquantitative structure-affinity relationship approach. Protein Pept Lett15: 1033-1043; Tian F, Yang L, Lv F, Yang Q, Zhou P (2009) In silicoquantitative prediction of peptides binding affinity to human MHCmolecule: an intuitive quantitative structure-activity relationshipapproach. Amino Acids 36: 535-554; Huang R B, Du Q S, Wei Y T, Pang Z W,Wei H, et al. (2009) Physics and chemistry-driven artificial neuralnetwork for predicting bioactivity of peptides and proteins and theirdesign. J Theor Biol 256: 428-435. S0022-5193(08)00450-5 [pii];10.1016/j.jtbi.2008.08.028 [doi]. One of these articles used a hugearray of physical properties in conjunction with complex multilayerneural networks. However, method using physical properties directlysuffers a major drawback in that there is really no way to know, or evento assess, what is the correct weighting of various physical properties.This is a major constraint as it is well known that the ability of NN tomake predictions depends on the inputs being properly weighted (Bishop,C. M. (1995), Neural Networks for Pattern Recognition, Oxford: OxfordUniversity Press. Patterson, D. (1996). Artificial Neural Networks.Singapore: Prentice Hall. Speckt, D. F. (1991). A Generalized RegressionNeural Network. IEEE Transactions on Neural Networks 2 (6), 568-576.).Besides simplifying the computations, appropriate weighting is afundamental advantage of using the principal components of amino acidsas proxies for the physical properties themselves. As FIG. 14 shows, thefirst three principal components accurately represent nearly 90% of allphysical properties measured in 31 different studies.

Multi-layer Perceptron Design. In some embodiments, one or moreprincipal components of amino acids within a peptide of a desired lengthare used as the input layer of a multilayer perceptron network. In someembodiments, the output layer is the probability of peptidase cleavage.In some embodiments, the first three principal components in Table 3were deployed as three uncorrelated physical property proxies as theinput layer of a multi-layer perceptron (MLP) neural network (NN)regression process (4) the output layer of which is a diagram depictingthe design of the MLP is shown in FIG. 7. The overall purpose is toproduce a series of equations that allow the prediction of theprobability of cleavage using the physical properties of the amino acidsin the peptide n-mer under consideration as input parameters. Inpreferred embodiments the n-mer is an octomer representing the peptidaseenzyme binding groove. Clearly more principal components could be used,however, the first three proved adequate for the purposes intended.

A number of decisions must be made in the design of the MLP. One of themajor decisions is to determine what number of nodes to include in thehidden layer. For the NN to perform reliably, an optimum number ofhidden notes in the MLP must be determined. There are many “rules ofthumb” but the best method is to use an understanding of the underlyingsystem, along with several statistical estimators, and followed byempirical testing to arrive at the optimum. The molecular binding groovecomprising the active site of many peptidases has been understood toaccommodate 8 amino acids. See, e.g., Schechter and Berger (1967).

In some embodiments, the number of hidden nodes is set to correlate toor be equal to the binding pocket domains. It would also be a relativelysmall step from PLS (linear) regression, but with the inherent abilityof the NN to handle non-linearity providing an advantage in the fittingprocess. A diagram of the MLP for peptidase cleavage prediction is inFIG. 7.

Training Sets and NN Quality Control. In developing NN predictive tools,a common feature is a process of cross validation of the results by useof “training sets” in the “learning” process. In practice, theprediction equations are computed using a subset of the training set andthen tested against the remainder of the set to assess the reliabilityof the method. To establish the generalize-ability of the predictions, arandom holdback cross validation procedure was used along with variousstatistical metrics to assess the performance of the NN.

A common problem with NN development is “overfitting”, or the propensityof the process to fit noise rather than just the desired data pattern inquestion. There are a number of statistical approaches that have beendevised by which the degree of “overfitting” can be evaluated. NNdevelopment tools have various “overfitting penalties” that attempt tolimit overfitting by controlling the convergence parameters of thefitting. The NN platform in JMP®, provides a method of r² statisticalevaluation of the NN fitting process for the regression fits. Generally,the best model is derived through a series of empirical measurements. Asthe output of the neural net is a categorical value, no standardizationis needed.

E. Peptidases

The present invention provides predictions of cleavage sites at whichpeptidases cleave an amino acid sequence. The processes described hereincan be applied to any class of peptidase, including, but not limited toendopeptidases, exopeptidases, intracellular peptidases, extracellularpeptidases, endosomal peptidases, and lysosomal peptidases. In someembodiments, the peptidases are in enzyme classes 3.4.11, 3.4.12,3.4.13, 3.4.14, 3.4.15, 3.4.16, 3.4.17, 3.4.18, 3.4.19, 3.4.20, 3.4.21,3.4.22, 3.4.23, or 3.4.24 and include serine proteases, cysteineproteases, aspartic acid proteases, and metalloendopeptidases. In someembodiments, the peptidase is a cathepsin. Representaive cathepsinsinclude, but are not limited to, Cathepsin A (serine protease),Cathepsin B (cysteine protease), Cathepsin C (cysteine protease),Cathepsin D (aspartyl protease), Cathepsin E (aspartyl protease),Cathepsin F (cysteine proteinase), Cathepsin G (serine protease),Cathepsin H (cysteine protease), Cathepsin K (cysteine protease),Cathepsin L1 (cysteine protease), Cathepsin L2 (or V) (cysteineprotease), Cathepsin O (cysteine protease), Cathepsin S (cysteineprotease), Cathepsin W (cysteine proteinase), and Cathepsin Z (or X)(cysteine protease).

In preferred embodiments, the processes of the present invention areutilized to analyze amino acid sequences to determine whether thesequences contain a peptidase cleavage motif and to determine theidentity and location of the motif. These process may be applied toamino acid sequence. In some embodiments, the amino acid sequencecomprises the amino acid sequences of a class of proteins selected fromthe group derived from the proteome of pathogenic microorganisms. Inother embodiments the amino acid sequences derive from a class ofproteins selected from the group comprising allergens (including but notlimited to plant allergen proteins and food allergens). In otherembodiments the amino acid sequences derive from a class of proteinsselected from the group comprising mammalian proteins including but notlimited to tumor associated antigen proteins, proteins reactive inautoimmunity, enzymes and structural mammalian proteins. In otherembodiments the amino acid sequences derive from a class of proteinsselected from the group comprising synthetic and recombinantlymanufactured proteins, including but not limited to biopharmaceuticals(e.g., replacement enzymes, clotting factors, monoclonal antibodies andantibody fusions) and industrial proteins (for example in foodadditives, textiles, wood). These examples however should not beconsidered limiting as the analytical approach can be applied to anypeptidase of any species or source provided training sets can bedeveloped experimentally by any means for that enzyme.

F. Summary of Peptidase Cleavage Prediction Methodology.

Several new proteomic techniques have recently been developed thatidentify newly created cleavage sites through isotopic or other labelingof the residues flanking the cleavage site, and mass spectrometry. Thesecleavage site labelling (CSL) techniques can characterize in a singleexperiment hundreds of times as many cleavage events as previouslycatalogued for a peptidase.

The active site of peptidases can be conceptualized as an enzyme bindingpocket with 8 contiguous topographic subsites comprising the cleavagesite octomer (CSO) are consecutively numbered outward from the cleavagesite between P1 and P1′ as P4-P3-P2-P1|P1′-P2′-P3′-P4′. Thecorresponding binding sites on the enzyme are numbered S4 . . . S4′.

As described herein, it is contemplated that PCAA can be used to developclassifiers for prediction of peptidase cleavage sites using the muchlarger datasets produced by the CSL proteomic processes and provided inthe Supplemental materials of three recent publications (Biniossek, etal (2011); Impens, F. et al. (2010); Tholen, S. et al. (2011) above).The present invention utilizes a probabilistic neural network perceptronthat is essentially a non-linear PLS regression (Bishop, 1995, above)provided in a widely used statistical program (JMP®) to develop theprediction equations. The present invention further utilizes aperceptron with mathematical symmetry to the biological subject, in thiscase the peptidase binding pocket. This process generates accuratepredictions of cleavages for each of the peptidases studied.

In some preferred embodiments, non-redundant octomer datasets arederived from the CSL datasets, indexed by single amino acid displacementthroughout the protein, and used for training the neural net.Non-redundant sets consisting of a single representatives of aparticular CSO are preferably created because the presence of multiplecopies of a particular cleaved peptide could be due either to parentprotein abundance in the protein mixture used or because of thecleavability by a particular peptidase. Since all of the proteins inthese experimental sets were exposed to the peptidase for extendedperiods of time and only a relatively small number of sites werecleaved, the non-cleaved peptides were assumed to be true negatives forclassifier training purposes.

The peptidase prediction output is a binary categorical variable(cleave/no-cleave) rather than a continuous real number as in MHCbinding affinity prediction (i.e. natural logarithm ic₅₀). Theperceptron is the topological description of the underlying mathematicalequation lattice comprising a neural network. For this purpose, thepresent invention utilizes a relatively simple perceptron having asingle input layer which comprises the amino acid principal componentvectors, a single layer of hidden nodes and a single binary output thatis the predicted probability of cleavage coded as a zero or one. Thenumber of hidden nodes is preferably fixed at eight, providing symmetrybetween the mathematics of the perceptron and the conceptual underlyingenzyme binding pocket.

The PCAA are preferably derived by eigen decomposition of thecorrelation matrix of 31 different studies as described above. It iscontemplated that use of the correlation matrix as a foundation makes itpossible to combine the results of a wide variety of studies withdifferent scoring metrics to create a composite set of vectors that aremutually orthogonal (i.e. uncorrelated), zero-centered, andappropriately weighted for their relative contributions. The peptidasecleavage datasets provided by the CSL studies have 500 to 3000 cleavedpeptides that exceeded the detection limit of mass spectrometry for eachexperimental conditions in a total of up to approximately 800 identifiedproteins. For cathepsin L, training sets resulting from three differentexperimental conditions were used, and for cathepsin S, training setsresulting from two different experimental conditions were used. Theconditions of cleavage under which the experiments that produced thetraining sets were done at pH 6 (cathepsin L and cathepsin S) and pH7.5(cathepsin S) to represent the range pH conditions found in the endosomecompartments at different stages of antigen processing.

In preferred embodiments, the present invention encompasses creatingsmall, balanced subsets for training (bootstrap aggregation, “bagging”)and validation of the classifiers and then using the resultingpredictors for larger datasets. Preferably, a 5 k-fold cross validationis performed 5 times, each time starting with a different seed for therandom number generator.

In some preferred embodiments, ensembles of 25 discriminant equationsare produced for each amino acid found at the P1 and P1′ position of theCSO. As nearly every amino acid (of the 20 possible) is found at each ofthose positions up to nearly 1000 total discriminant equations areproduced for each experimental set. Each of the members of the ensemblepreferably comprises a randomly selected, independent predictor of theprobability of the cleavage of a peptide at a certain P1-P1′ pair basedon the combinatorial amino acids in the flanking positions. Theseequations perform very well with a true positive rate of approximately90% and a false positive rate of about 10%. In preferred embodiments,probabilities from the multiple different training sets are combined andthe maximum probability of the group of predictions is used as themetric (cathepsin L=six total predictions, P1 and P1′ for 3 differentexperimental conditions; cathepsin S=four total predictions, P1 and P1′for 2 experimental conditions).

G. Applications

The processes of the present invention are useful for a wide variety ofapplications.

In some preferred embodiments, the processes described above provide thelocation and identity of peptidase cleavage motifs in a target proteinor polypeptide (referred to hereafter as the target polypeptide). Thepresent invention provides further processes wherein the targetpolypeptides are modified based on this information. In someembodiments, the target polypeptide is modified to mutate the peptidasecleavage motif so that the resulting mutated target polypeptide is notcleaved by the peptidase at or in proximity to the mutated site. Inother embodiments, the target polypeptide is modified to include apeptidase cleavage motif for a particular peptidase(s) at a defined sitewithin the target polypeptide. In some embodiments, these modificationsare utilized to alter the degradation of protein, and in some preferredembodiments, the degradation of a protein that is utilized as a drug. Insome embodiments, the target polypeptide comprises an active polypeptide(e.g., the polypeptide is a prodrug) and is modified to include apeptidase cleavage motif to facilitate release of the activepolypeptide. Thus, the present invention encompasses design ofpolypeptides that are cleaved by particular peptidases and at siteswithin or outside the body where specific peptidases are expressed.

In some embodiments determination of the cleavage site is used inselecting a polypeptide for inclusion in an immunogen or vaccine, suchthat the immunogen will be cleaved predictably for binding by an MHCmolecule and presentation at the cell surface to a T cell receptor andhence stimulate immunity, or such that a B cell epitope is bound to a Bcell receptor and internalized by the B cell. In some embodiments saidimmunogenic peptide is selected to provide a cleavage site 4, or 5 or 6or up to 20 amino acids from the N terminal or the C terminal of theselected peptide. In some embodiments the peptide is selected such thatthe predicted cleavage site is separated from the immunogenic peptide byflanking regions of 1, or 2 or 3 or up to 20 amino acids.

In other embodiments, the polypeptide to be included in an immunogen orvaccine is modified by standard cloning and genetic engineeringtechniques to include a peptidase cleavage motif for a particularpeptidase(s) at a defined site within the polypeptide. In some preferredembodiments, the defined site flanks an amino acid sequence that is a Bcell epitope or is otherwise recognized by an MHC molecule forpresentation at the cell surface to a T cell receptor and stimulation ofimmunity.

In some embodiments, the processes described herein are utilized todesign peptidase inhibitors. As described above, the present inventionutilizes principal components analysis and neural networks to identifyamino acid sequences that interact with binding pocket of a peptidase.In some embodiments, the process are utilized to identify molecularentities (such as small molecule drugs, peptide or polypeptides) thatinterfere with or modulate the interaction between the peptidase bindingpocket of a peptidase and its preferred peptidase cleavage motif whichis located in a target polypeptide acted on by the peptidase. Forexample, the present invention may be utilized to design peptides whichblock the peptidase binding pocket or which interfere with binding tothe peptidase binding pocket.

In other embodiments, the processes described herein are used todetermine the potential side effects of protease inhibitors bypredicting the downstream impact of protease inhibition on proteinsother than the target protein.

In preferred embodiments, PCAA is used in a numerical method to predictpeptidase cleavage patterns. Bremel, R. D. & Homan, E. J. An integratedapproach to epitope analysis I: Dimensional reduction, visualization andprediction of MHC binding using amino acid principal components andregression approaches. Immunome. Res. 6, 7 (2010); Bremel, R. D. &Homan, E. J. An integrated approach to epitope analysis II: A system forproteomic-scale prediction of immunological characteristics. Immunome.Res. 6, 8 (2010). The process described herein makes it possible toutilize the large datasets produced by CSL in a straightforward way astraining sets for developing peptidase cleavage predictors. As a leastsquares fitting process using numerical input vectors, it should berelatively immune to the types of peptide-similarity bias that haveemerged as a problem in the use of alphabetic representations ofpeptides and the use of categorical predictors. Yang, Z. R. Predictionof caspase cleavage sites using Bayesian bio-basis function neuralnetworks. Bioinformatics. 21, 1831-1837 (2005). The z-scale vectors usedas predictors were derived by eigen decomposition of the correlationmatrix between a large number of studies on the biophysical propertiesof amino acids. Bremel, R. D. & Homan, E. J. An integrated approach toepitope analysis I: Dimensional reduction, visualization and predictionof MHC binding using amino acid principal components and regressionapproaches. Immunome. Res. 6, 7 (2010). This process is fundamentally adimensional reduction that produces a set of numbers that are proxyvariables. The first two z-scale vectors have obvious relationships tofamiliar biophysical properties of amino acids. These dimensionlessnumerical predictors have several additional characteristics; mostimportantly they are uncorrelated and thus unique predictors and theyare appropriately scaled within and among each other. Use of correlatedand inappropriately scaled predictors can lead to biases in predictionsdue to scale effects in the underlying algorithms which are an issue toconsider in the many amino acid based prediction schemes that have beendescribed. Bishop, C. M. Neural Networks for Pattern Recognition (OxfordUniversity Press, Oxford, 1995); Beck, H. et al. Cathepsin S and anasparagine-specific endoprotease dominate the proteolytic processing ofhuman myelin basic protein in vitro. Eur. J. Immunol. 31, 3726-3736(2001). Du, Q. S., Wei, Y. T., Pang, Z. W., Chou, K. C., & Huang, R. B.Predicting the affinity of epitope-peptides with class I MHC moleculeHLA-A*0201: an application of amino acid-based peptide prediction.Protein Eng Des Sel 20, 417-423 (2007). The methods of the presentinvention also avoid the issues and complexities that arise indevelopment of rule based systems. Rognvaldsson, T. et al. How to findsimple and accurate rules for viral protease cleavage specificities.BMC. Bioinformatics. 10, 149 (2009). The methods of the presentinvention have been illustrated for several cathepsins, however theconcepts should be broadly applicable to peptidases in general. Overallthe performance of the predictors, in terms of the specificity andsensitivity, are similar to those of signal peptidase predictors whichare routinely used for identification of signal peptides in genomecuration and for pattern classifiers in general (see FIG. 12a-c ). Choo,K. H., Tan, T. W., & Ranganathan, S. A comprehensive assessment ofN-terminal signal peptides prediction methods. BMC. Bioinformatics. 10Suppl 15, S2 (2009). A rule-based approach such as POPS uses PSSM inconjunction with various biophysical properties to produce cleavageprobability predictions. Boyd, S. E., Pike, R. N., Rudy, G. B.,Whisstock, J. C., & Garcia, d.l.B. PoPS: a computational tool formodeling and predicting protease specificity. J. Bioinform. Comput.Biol. 3, 551-585 (2005). A wide variety of systems are available forprediction of many different protein physical properties and, ingeneral, an indexing window is used to produce numerical metricsattributable to the particular biophysical property. To the extent thatthe cleavage probability is related to physical properties of aminoacids in the CSO, the multi-dimensional PCAA using z-scale vectorsshould also capture those biophysical properties and, thus, thisapproach also embodies elements of some of the other secondarypredictors used in rule-based approaches. The CSL input datasets usedfor this analysis each provided data only for two reaction time points.However, with sufficient time points in an experimental protocol, theapproach could be extended to produce a kinetic estimate and thuspredict the rate of cleavage of different CSO.

Several different proteomic and combinatorial library cleavage analytictechniques have been developed recently. The methods derive their powerfrom being able to cross-reference mass spectrometer fragmentationpatterns with translated genomic data to identify peptidase targetsequences from peptide fragmentation patterns. Approaches like “proteomeidentification of cleavage specificity” (PICS) or “isobaric tags forrelative and absolute quantification terminal amine isotopic labeling”(iTRAQ-TAILS) “stable isotope labeling of amino acids in culture”(SILAC), can be used to characterize very large numbers of cleavageevents. Biniossek, M. L., Nagler, D. K., Becker-Pauly, C., & Schilling,O. Proteomic identification of protease cleavage sites characterizesprime and non-prime specificity of cysteine cathepsins B, L, and S. J.Proteome. Res. 10, 5363-5373 (2011); Impens, F. et al. A quantitativeproteomics design for systematic identification of protease cleavageevents. Mol. Cell Proteomics. 9, 2327-2333 (2010); Tholen, S. et al.Contribution of cathepsin L to secretome composition and cleavagepattern of mouse embryonic fibroblasts. Biol. Chem. 392, 961-971 (2011).The combinatorial complexity of the amino acid frequencies in the CSO ofthe cathepsin cleavage products is enormous; virtually every amino acidis found in every position of the CSO and for any particular peptidase,the amino acid frequency in the CSO diverges significantly from theoverall frequency in proteins only at a few positions. Theever-increasing size of the databases generated by CSL techniques hasilluminated the inadequacies of the general catalogs of cleavage sitespreviously developed from very limited experimentation. Earlier attemptsto develop prediction algorithms with as few as 13 cleavage products arenow overshadowed by experimental procedures that generate thousands ofcleavage products. Yang, Z. R. Prediction of caspase cleavage sitesusing Bayesian bio-basis function neural networks. Bioinformatics. 21,1831-1837 (2005). While heat diagrams or alphabetic logos are useful,and do provide an impression of amino acid preferences in the CSO, theyare of limited utility in development of prediction tools. Colaert, N.,Helsens, K., Martens, L., Vandekerckhove, J., & Gevaert, K. Improvedvisualization of protein consensus sequences by iceLogo. Nat. Methods 6,786-787 (2009); Rigaut, K. D., Birk, D. E., & Lenard, J. Intracellulardistribution of input vesicular stomatitis virus proteins afteruncoating. J. Virol. 65, 2622-2628 (1991). Database approaches areinherently limited to detecting direct matches, but can be expandedsomewhat with rule-sets and sufficiently large databases. Schilling, O.& Overall, C. M. Proteome-derived, database-searchable peptide librariesfor identifying protease cleavage sites. Nat. Biotechnol. 26, 685-694(2008); Song, J. et al. Bioinformatic approaches for predictingsubstrates of proteases. J. Bioinform. Comput. Biol. 9, 149-178 (2011);Schilling, O., auf dem, K. U., & Overall, C. M. Factor Xa subsitemapping by proteome-derived peptide libraries improved using WebPICS, aresource for proteomic identification of cleavage sites. Biol. Chem.392, 1031-1037 (2011). Rule based systems have been successfullyemployed with peptidases with relatively specific CSO sequencerequirements. Rognvaldsson, T. et al. How to find simple and accuraterules for viral protease cleavage specificities. BMC. Bioinformatics.10, 149 (2009). Rule-extraction for combinatorial patterns like thoseseen with the cathepsins used here, where all the frequencies are nearlyequivalent, may be intractable.

Endosomal peptidases only generate a few cleavages (several thousand)from among a very large potential number of octomers (several hundredthousand) in a protein mixture. This low frequency, when combined withthe combinatorial complexity, is a challenge in developing predictiontools an issue that has been recently reviewed by Chou. Chou, K. C. Someremarks on protein attribute prediction and pseudo amino acidcomposition. J. Theor. Biol. 273, 236-247 (2011). However, substantialresearch has gone into development of algorithmic approaches to dealwith low frequency events in other fields. Chawla, N., Lazarevic, A.,Hall, L., & Bowyer, K. SMOTEBoost: Improving prediction of the minorityclass in boosting. Knowledge Discovery in Databases: PKDD 2003107-119(2003); Chawla, N., Eschrich, S., & Hall, L. O. Creating ensembles ofclassifiers. Data Mining, 2001.ICDM 2001, Proceedings IEEE InternationalConference, 580-581. 2001; Chawla, N. V. Data mining for imbalanceddatasets: An overview. Data Mining and Knowledge Discovery Handbook875-886 (2010). Generating multiple ensembles of prediction equationsthrough the local optimization training process is preferred forproducing predictions with useful sensitivity and specificity. Cieslak,D. A. & Chawla, N. V. Start globally, optimize locally, predictglobally: Improving performance on imbalanced data. Data Mining,2008.ICDM'08. Eighth IEEE International Conference, 143-152. 2008;Lichtenwalter, R. & Chawla, N. Adaptive methods for classification inarbitrarily imbalanced and drifting data streams. New Frontiers inApplied Data Mining 53-75 (2010); Tang, Y., Zhang, Y. Q., Chawla, N. V.,& Krasser, S. SVMs modeling for highly imbalanced classification.Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on39, 281-288 (2009). Furthermore, in preferred embodiments, reducing thedimensionality of the datasets by the use of anchor residues in the CSOis a helpful simplification. Although it adds data processingcomplexity, a process in which prediction equation ensembles are derivedindividually for each of the different amino acids found at both the P1and the P1′ anchor positions provides for an independence of thepredictor output probabilities and thus an increased confidence level inthe results.

An additional complication, related to the low frequency of actualcleavages relative to the potential cleavages, is whether the uncleavedCSO are valid “true negatives” that are essential for development of anybinary classifier prediction tool. All the non-cleaved octomers arepreferably chosen as true negatives but it is a source of potential biasparticularly for several of the datasets derived from degradomicapproaches. Impens, F. et al. A quantitative proteomics design forsystematic identification of protease cleavage events. Mol. CellProteomics. 9, 2327-2333 (2010); Tholen, S. et al. Contribution ofcathepsin L to secretome composition and cleavage pattern of mouseembryonic fibroblasts. Biol. Chem. 392, 961-971 (2011). Peptide-centricapproaches such as used for the human cathepsin B, L, and S data setsare actually designed to delineate the active site specificity.Biniossek, M. L., Nagler, D. K., Becker-Pauly, C., & Schilling, O.Proteomic identification of protease cleavage sites characterizes primeand non-prime specificity of cysteine cathepsins B, L, and S. J.Proteome. Res. 10, 5363-5373 (2011). Degradomic procedures such as thoseused to produce the cathepsin D, E, and murine L datasets, are thoughtto be less suitable to subsite mapping studies. Impens et al., Tholen etal., Schilling, O. & Overall, C. M. Proteome-derived,database-searchable peptide libraries for identifying protease cleavagesites. Nat. Biotechnol. 26, 685-694 (2008). Cathepsins have been studiedfor many years and the many studies were reviewed recently by Turk etal. Turk, V. et al. Cysteine cathepsins: from structure, function andregulation to new frontiers. Biochim. Biophys. Acta 1824, 68-88 (2012).Structural studies of the enzymes suggest that for cleavage to occur theCSO peptide must assume an extended configuration within the active sitecleft. Turk et al. Pretreatment with one peptidase may release shorterpeptides, making CSO's more accessible to a second peptidase.

Thus, one can envision that a single cathepsin enzyme working inisolation on intact protein molecules might have difficulty in accessingcertain P1-P1′ pairs, even though those pairs might be cleavable ifpresented to the enzyme as an extended free peptide. It is thereforelikely that an isolated cathepsin will give an underestimate of thenumber of actual cleavages of which that enzyme might be capable in anoptimal milieu. For example, cleavage of HIV GP120, a protein of 525amino acids, by three different cathepsins individually produces arelatively small set of approximately 10 total peptides, some of themredundantly produced by the different enzymes. Yu, B., Fonseca, D. P.,O'Rourke, S. M., & Berman, P. W. Protease cleavage sites in HIV-1 gp120recognized by antigen processing enzymes are conserved and located atreceptor binding sites. J. Virol. 84, 1513-1526 (2010). By comparison,Beck et al. using several cathepsins in combination obtained over 60peptides from the much smaller C-terminal half of myelin basic proteinwhich comprises 170 amino acids. Beck, H. et al. Cathepsin S and anasparagine-specific endoprotease dominate the proteolytic processing ofhuman myelin basic protein in vitro. Eur. J. Immunol. 31, 3726-3736(2001). In the CSL process used by Binniosek et al. individual enzymesof interest are given access over an extended period to a peptidelibrary produced by prior cleavage of a protein mixture by a differentpeptidase. Under these circumstances it seems reasonable to assume thatall the potential scissile bonds were available to the peptidase underevaluation, and thus all the non-cleaved CSOs are true negatives. Thedatasets produced in more biological models, where the test peptidase isoperating on a protein mixture such as a secretome, the situation isdifferent and the peptidases might not have had access to all potentialscissile bonds. Tholen et al., Impens et al. Here the use of uncleavedCSOs as true negatives might be problematic. This could possibly beverified by direct experimentation using the PIC approach on the sameprotein mixtures, but the lower specificities and sensitivities (approx.0.8) seen with the degradomic data sets are consistent with there beinga number of true positives still present among the uncleaved CSOs andwhich were inaccessible to the enzyme and thus inappropriately bias thepredictors.

An additional consideration in choosing the true negatives from anexperimental approach like that of Binniossek et al is whether oneshould create the true negative CSO training set in silico by a two-stepapproach also: first cleave by a first enzyme (i.e. trypsin) in silicoand then derive negative CSOs from the resulting shorter peptides ratherthan the intact proteins. In preferred embodiments, this is not donebecause it is not clear that the cleavage by trypsin is sufficientlyexhaustive to cleave at all possible sites. Further, use of theindependent P1 and P1′ predictors means that cleavage predictions can begenerated for peptides with a K or R in the P1 position based on the P1′predictor, while these cleavage sites would be destroyed by trypsin.Information from other studies indicates that each of the cathepsinsused also cleave peptides with a K or R in the P1 position. Beck, H. etal. Cathepsin S and an asparagine-specific endoprotease dominate theproteolytic processing of human myelin basic protein in vitro. Eur. J.Immunol. 31, 3726-3736 (2001); Rawlings, N. D., Barrett, A. J., &Bateman, A. MEROPS: the database of proteolytic enzymes, theirsubstrates and inhibitors. Nucleic Acids Res. 40, D343-D350 (2012).

Development of pattern classification tools is an empirical process.Duda, R. O., Hart, P. E., & Stork, D. G. Pattern Classification (JohnWiley & Sons, Inc., 2001). The results presented herein were obtainedwith one specific classification tool, a probabilistic neural net builtin the context of the JMP platform. Nominally similar software platformscan produce potentially widely disparate results. Chou, K. C. Someremarks on protein attribute prediction and pseudo amino acidcomposition. J. Theor. Biol. 273, 236-247 (2011). Because of the natureof the underlying algorithms and the stochastic way that they areimplemented, it must be recognized that every software package willlikely produce different results with a different degree of accuracy.There are many ways to implement a neural network algorithm and theresults may differ substantially in detail. DTREG (dtreg.com) is acommercial source of a number of different prediction tools and thecomparisons that this software supplier provides for different packageswith benchmark datasets (dtreg.com/benchmarks.htm) are instructive as tothe variation in the predictions that can be expected from differentpackages (see FIG. 12a-c ). During the development of the presentinvention, several packages were downloaded from www.dtreg.com andutilized for testing, focusing on classifiers that had cross-validationand control features comparable to the JMP® platform. The area under thereceiver operating curve (AROC) was utilized as a comparison metric andfound that the probabilistic neural net predictor using a radial basisactivation function and a support vector machine (SVM) that uses thesame activation function as the JMP® neural network produced resultscomparable to the JMP® probabilistic neural net. A SVM is generallyconsidered to be less susceptible to overfitting. Interestingly, the SVMseemed to provide somewhat better performance with those cathepsin D andE amino acid anchors that were problematic for the JMP® platform (FIG.12). The SVM found in the e1071 package in the R statistical platform(http://www.r-project.org/, FIG. 13) was tested with an equivalenttraining approach and its performance was generally comparable to theJMP neural net. The conceptual symmetry between the neural netperceptron structure and the active site is useful as a “reasoningenvironment”. Boyd, S. E., Pike, R. N., Rudy, G. B., Whisstock, J. C., &Garcia, d.l.B. PoPS: a computational tool for modeling and predictingprotease specificity. J. Bioinform. Comput. Biol. 3, 551-585 (2005).Although the SVM produces accurate predictions they are produced byhundreds of support vectors in multidimensional space, which does notprovide a means to translate the output directly to experimental work.

Issues related to the false discovery (false positive) rate of in silicopeptidase cleavage predictors have been discussed elsewhere. Schilling,O. & Overall, C. M. Proteome-derived, database-searchable peptidelibraries for identifying protease cleavage sites. Nat. Biotechnol. 26,685-694 (2008). The false discovery rate shown in FIG. 4 (FP) variesdepending on the amino acids at P1 and P1′ and also varies betweendifferent peptidases but is generally about 10%. This is among thebetter of the error rates of a number of widely used classifiers withbenchmark data sets (FIG. 12)

In summary, a mathematical process has been described for usingproteomic data in combination with the statistical principles ofprincipal component analysis, partial least squares regression andmachine learning algorithms to develop predictors for peptidasecleavage. Results were shown for several cathepsins but the approach isextensible and should be broadly applicable to any type of peptidaseenzyme. Tools that enable predictions beyond the protein sets that havebeen used in experiments will have a number of practical uses. With aprocess based on mathematical formulae, simulators can readily beconstructed to predict cleavage of theoretical sequences and to assistor complement other experimental work.

EXAMPLES

To examine whether the predictions of peptidase cleavage sites derivedfrom the computer based analytical process described herein, werecorrelated with data from experimental characterization of cleavagesites described in the scientific literature, the inventors conducted anumber of analyses as described below.

Example 1 Correlation with Results Obtained by CSL Techniques

Here the inventors show that PCAA can be used as the basis fordeveloping classifiers for prediction of proteolytic cleavage sitesusing the much larger datasets produced by the CSL proteomic processesin three recent publications (Impens et al., Tholen et al., Biniossek etal., referenced above) and provided in Supplemental datasets of thesepublications are used to demonstrate the principle of the technique forhuman Cathepsin B, L, and S and murine Cathepsin D in E, and L. Theselysozomal (endosomal) peptidases are each thought to play roles inproteolytic processing for MHC display on the surface of antigenpresenting cells as well as a variety of other physiological functions.A probabilistic neural network perceptron that is essentially anon-linear PLS provided in a widely used statistical program (JMP®) isused as the basis for developing the prediction equations. A perceptronis utilized which has a mathematical symmetry to the biological subject,in this case the binding pocket of peptidase. The inventors show thatthe process is capable of producing accurate predictions of proteolyticcleavages for each of the peptidases studied. The inventors further showthat by weighting the physical properties at different positions withinthe CSO, the performance of the predictions can be improved. Predictoroutputs are also compared to experimental results in the literature formyelin basic protein.

Datasets comprising non-redundant CSOs indexed by a single amino acidwere derived (FIG. 8) from protein sequences identified in three recentpublications and all peptides were given a binary code (0,1) dependingon whether or not they had been cleaved or not between the P1 and P1′.See Biniossek et al., Impens et al., Tholen et al. For the analysisdescribed below alphabetic peptide sequences were converted into z-scalenumerical vectors following the general method of using PCAA describedrecently to develop predictors of peptide binding to MHC molecules.Bremel, R. D. & Homan, E. J. An integrated approach to epitope analysisI: Dimensional reduction, visualization and prediction of MHC bindingusing amino acid principal components and regression approaches.Immunome. Res. 6, 7 (2010); Bremel, R. D. & Homan, E. J. An integratedapproach to epitope analysis II: A system for proteomic-scale predictionof immunological characteristics. Immunome. Res. 6, 8 (2010). Theapproach is based on use of the first 3 principal component z scalevectors, wherein z₁ is related to polarity/hydrophobicity, z₂ is relatedto size, and z₃ related to the electronic character of the amino acid.Below statistics and patterns of these z-scale vectors within the CSOare described and the inventors further demonstrate their use in aprobabilistic neural net classifier for prediction of peptidasecleavage.

Data Sets

The datasets used are from supplemental data tables in three recentpublications Biniossek et al., Impens et al., Tholen et al. The proteinsidentified in these tables were downloaded from Genbank based onidentifiers provided by the authors. Non-redundant octomer datasets,indexed by single amino acid displacement throughout the protein, werecreated and used for training the neural net as described below.Non-redundant sets were created because the presence of multiple copiesof a particular cleaved peptide could be due either to parent proteinabundance in the protein mixture used or because of the cleavability bya particular peptidase. Since all of the proteins in these experimentalsets were exposed to the peptidase for extended periods of time and onlya relatively small number of sites were cleaved, the non-cleavedpeptides were assumed to be “true negatives” for classifier trainingpurposes

Probabilistic Neural Network (NN)

To develop a peptidase prediction pattern classification system anapproach was used that is analogous to that used for predictions ofpeptide binding affinity for MHC I and MHC II. In the current situation,however, the prediction output was a binary categorical variable(cleave/no-cleave) rather than a continuous real number (i.e. bindingaffinity). The perceptron is the topological description of theunderlying mathematical equation lattice comprising a neural network.Many possible configurations of perceptron can be constructed withdifferent numbers of layers and inputs. A diagram of the perceptron usedis shown in FIG. 7. This is a relatively simple perceptron, having asingle input layer which comprises the amino acid principal componentvectors, a single layer of hidden nodes and a single binary output thatis the predicted probability of cleavage coded as a zero or one. Thelines connecting the different portions of the perceptron represent“activation functions”. For the work described, the sigmoidal hyperbolictangent activation function was used. The number of hidden nodes wasfixed at eight providing symmetry between the mathematics of theperceptron and the underlying enzyme binding pocket conceptualization.In addition the symmetry between the mathematics and the molecularcharacteristics enables use of the simulation tools provided within theJMP® application to have direct and obvious relationships to potentialexperimental work.

Principal Components as Input to the Neural Net

The principal components of amino acid physical properties were derivedby eigen decomposition of the correlation matrix of 31 different studiesas described previously. Bremel, R. D. & Homan, E. J. An integratedapproach to epitope analysis I: Dimensional reduction, visualization andprediction of MHC binding using amino acid principal components andregression approaches. Immunome. Res. 6, 7 (2010). Use of thecorrelation matrix as a foundation makes it possible to combine theresults of a wide variety of studies with different scoring metrics tocreate a composite set of vectors that are mutually orthogonal (i.e.uncorrelated), zero-centered, and appropriately weighted for theirrelative contributions. The latter characteristic is the reasonprincipal components are used as weighting factors for regressionanalysis. The relationship of the principal components to well-knownbiophysical characteristics is readily seen in the z-scale vectors inTable 3. The first principal component (z₁) is a polarity/hydrophobicitymetric, the second principal component (z₂) a size metric, and the thirdprincipal component (z₃) embodies electronic characteristics of eachamino acid. Bremel, R. D. & Homan, E. J. An integrated approach toepitope analysis I: Dimensional reduction, visualization and predictionof MHC binding using amino acid principal components and regressionapproaches. Immunome. Res. 6, 7 (2010); Bremel, R. D. & Homan, E. J. Anintegrated approach to epitope analysis II: A system for proteomic-scaleprediction of immunological characteristics. Immunome. Res. 6, 8 (2010).The principal component proxy variables are a unique type of descriptor.They are real numbers with many significant figures from the eigendecomposition process but they have discrete quantized values. When usedas descriptors for a peptide three numbers, the first three principalcomponents, are used for each amino acid and these numbers assumevarious combinatorial sets within peptides.

TABLE 3 First three principal components of amino acid physicalproperties “z-scale” vectors derived from the correlation matrix of 31different studies. Bremel, R. D. & Homan, E. J. An integrated approachto epitope analysis I: Dimensional reduction, visualization andprediction of MHC binding using amino acid principal components andregression approaches. Immunome. Res. 6, 7 (2010). Amino Amino Aminoacid z1 acid z2 acid z3 K −6.68 W −3.50 C −3.84 R −6.30 R −2.93 H −1.94D −6.04 Y −2.06 M −1.46 E −5.70 F −1.53 E −1.46 N −4.35 K −1.32 R −0.91Q −3.97 H −1.00 V −0.35 S −2.65 Q −0.47 D −0.18 H −2.55 M −0.43 I 0.04 T−1.43 P −0.36 F 0.05 G −0.76 L −0.20 Q 0.15 P −0.03 D 0.03 W 0.16 A 0.72N 0.22 N 0.30 C 2.11 I 0.29 Y 0.37 Y 2.58 E 0.34 T 0.94 M 4.14 T 0.80 K1.16 V 4.79 S 1.84 L 1.17 W 5.68 V 1.98 G 1.21 L 6.59 A 2.48 S 1.30 I6.65 C 2.74 A 1.42 F 7.19 G 3.08 P 1.87

Training Sets

The peptidase cleavage data provided by the CSL studies is imbalanced inmultiple ways. When decomposed into successive octomers indexed by oneamino acid, the protein datasets used have between 350,000 to 550,000non-cleaved octomers, or a vast excess over the 500 to 3000 cleavedpeptides. A second layer of complexity derives from the differingfrequencies at which any P1-P1′ pair is cleaved compared to anotherdipeptide, overlaid on the differing abundance of each dipeptide pair.The fundamental issue is how to “train” and validate the classifiersappropriately with the number of non-cleavage events vastly outnumberingthe cleavage events. Low frequency events are not an uncommon problemand various strategies have been developed for dealing with thissituation. Chawla et al have done considerable work on the issue.Tholen, S. et al. Contribution of cathepsin L to secretome compositionand cleavage pattern of mouse embryonic fibroblasts. Biol. Chem. 392,961-971 (2011); Schechter, I. & Berger, A. On the size of the activesite in proteases. I. Papain. Biochem. Biophys. Res. Commun. 27, 157-162(1967); Ng, N. M. et al. The effects of exosite occupancy on thesubstrate specificity of thrombin. Arch. Biochem. Biophys. 489, 48-54(2009); Boyd, S. E., Pike, R. N., Rudy, G. B., Whisstock, J. C., &Garcia, d.l.B. PoPS: a computational tool for modeling and predictingprotease specificity. J. Bioinform. Comput. Biol. 3, 551-585 (2005). Anapproach shown to produce robust results for diverse situations is to“optimize locally and apply globally” by creating small, balancedsubsets (ensembles) for training and validation of the classifiers andthen using the resulting predictors for larger datasets. The generalscheme for ensemble assembly is shown in and in FIG. 8. A relativelyexhaustive 5×5×5-fold cross validation scheme based on concepts outlinedin Cieslak and Chawla was utilized. Cieslak, D. A. & Chawla, N. V. Startglobally, optimize locally, predict globally: Improving performance onimbalanced data. Data Mining, 2008.ICDM'08. Eighth IEEE InternationalConference on, 143-152. 2008. IEEE.

As shown in FIG. 8 a total of five ensembles of cohort trainer sets werecreated for each of the possible anchor amino acids. The trainingcohorts are anchored and balanced by pairing each amino acid at the P1or P1′ position or with a number of peptides (N) from the non-cleavedset having the identical P1, P1′ or (dipeptide) P1-P1′. The choice of Nis an empirical one and after experimentation N=4 was chosen as anapproach providing an excess of non-cleaved peptides for adequatetraining while limiting over-fitting in the 5 k-fold cross validationscheme. Thus, for each ensemble cohort with a particular P1 or P1′ aminoacid, the cleaved peptides were combined with 4 times as many uncleavedpeptides sharing amino acids at the matching the anchor positions. Eachof the cohort sets uses the same cleaved peptides with differentnon-cleaved trainers. Then a 5 k-fold cross validation was performed 5times, each time starting with a different seed for the random numbergenerator. With this scheme a random 80 percent of the uncleaved pluscleaved ensemble cohorts sets are used repeatedly during training andtested on the remaining 20 percent to produce a discriminant predictionequation on convergence of the underlying algorithms that operate usinga standard least-squared fitting process. Then, using the same ensemblebut with a different random number seed, a different cohort set (of 80%)is selected and another prediction equation is produced. Random numbergenerators produce defined sequences of random numbers, so initiatingthe process with a different seed effectively defines a different pathto convergence of the algorithms. This process is repeated a total of 5times for each ensemble and therefore, by use of the 5 differentensembles, a total of 25 different discriminant equation sets (positiveand negative predictions) were produced. Each equation set produces itsown unique and independent cleavage probability estimate. Importantly,the non-cleaved negative trainers, in each cohort were unique to each ofthe five cohorts and the prediction equations produced for that cohorthad not “seen” the peptides used by the other cohorts until the finalstage of cross validation where the equations were tested against theother four sets of non-cleaved peptides.

Principal Component Patterns within the Cleavage Site Octomer

For the peptidases under consideration, nearly every amino acid canfound at every position of the CSO. The utility of using an anchoringamino acid at subsite P1 was recognized based on examination of the someof the underlying statistics of the z-scale vectors of the cleavedpeptides compared to uncleaved peptides. FIGS. 1 and 2 demonstrate thepatterns that emerge from analysis of the mean and variance. FIG. 1shows the analysis of peptides within the CSO for human and murinecathepsin L. The differences between cleaved and uncleaved peptides inmean z₁ scale (polarity related) and z₂ scale (size related) principalcomponent metrics is shown for each of the 20 different amino acidsfound at P1 (i.e. the anchor), with the pattern for alanine highlighted.The graphic shows the differential in these metrics between the anchorposition (differential zero) and the adjacent subsite positions for thecleaved and random uncleaved CSOs. Distinct patterns are apparent forcleaved compared to uncleaved peptides. Human and murine cathepsin Lgenerate notably different patterns. While this may be attributable tothe fact that the data were derived from different experiments, BLASTresults of the human and murine enzymes indicate that they have onlyabout 72% sequence identity. Hence although they both have the “L”designation it is likely that the two peptidases are not orthologous.Turk, V. et al. Cysteine cathepsins: from structure, function andregulation to new frontiers. Biochim. Biophys. Acta 1824, 68-88 (2012).When P1′ is used as the anchor residue analogous unique patterns foreach amino acid are seen (not shown). The patterns suggest that withhuman cathepsin L, whatever the polarity of amino acid at P1, on averagea more apolar or hydrophobic residue occupies the P2 (FIG. 1, panel a,b). The situation is quite different for the murine cathepsin L (panelsc, d). The patterns also show that changes in z₂ scale (size related)occur on the prime side for both the human and murine enzymes (e, f, g,h) concomitant with changes in the z₁ scale (polarity related) ofresidues on the non-prime side of the scissile bond. Similar patternsoccur using the third principal component metric, z₃ scale (electronicrelated, not shown). These simultaneously changing patterns are whyregression approaches simultaneously using multiple physical propertiescan be effective at fitting the underlying patterns, and hence providethe basis for using those discriminant equations to predict theprobability of cleavage of other peptides.

As with changes in the means of the z₁ and z₂ scales, there are alsodifferences in the variance (standard deviation) of these physicalproperties between the cleaved and uncleaved peptides at each of thepositions in the CSO. Several notable features of the patterns are shownin FIG. 2. At positions P4 and P3 the variation between the cleaved andnon-cleaved peptides is not different from random, as indicated by theintersection of the confidence intervals of nearly all of thepeptidases. For the remaining positions however, there are distinctdifferences between the variation of the cleaved peptides and theirrandom uncleaved cohorts, with each peptidase displaying a uniquepattern. Taken together these patterns indicate how the least-squaresprocess of fitting of the variation in physical property data mightprovide a means of characterization of the patterns for the differentpeptidases. A notable feature of the data is that the standarddeviations at certain points are either greater or less than random.Examination of these patterns indicate that that a limited number ofamino acids of similar physical properties will produce a standarddeviation less than the random peptide set (e.g. human cathepsin B atP3′ position at both 4 and 16 hr reaction times, panel f). A variationgreater than random (e.g. murine cathepsin D and E at P3′ seen in panel0 is anomalous and is due to multimodal distributions of amino acid typeat those positions.

Application of Weighting Factor Patterns

Examination of a large number of patterns like those shown in FIGS. 1and 2 suggested that the amino acids nearest the cleavage site had thelargest impact on whether or not cleavage occurred. This is alsoconsistent with kinetic studies on several of the enzymes¹⁴ Basedonthis, a series of four unit-integral regression weighting patterns weredeveloped (FIG. 3). The weighting vectors have a pivot point at aminoacids P2 and P2′ and provide an enhanced emphasis for the P1 and P1′positions and decreased emphasis for the positions more distal from thecleavage site. Unit integral weighting was used to decrease the possibleinfluence of scale effects in the least-squared predictions. Duda, R.O., Hart, P. E., & Stork, D. G. Pattern Classification (John Wiley &Sons, Inc., 2001). As a control, tests showed that using uniformweighting, where each of the 3 principal component numbers at allpositions within the CSO were multiplied by 0.125 produced predictionsidentical to the unweighted PCAA as input variables (not shown).

FIG. 4 shows the detailed performance data for the predictors for humancathepsin L. Equivalent graphics for human cathepsin B and S are foundin FIG. 9 and FIG. 10. These patterns are a visualization of thecomposite of all of the prediction results for all peptides derived fromthe training sets with the 5 cohort ensemble sets each carried out the 4different weighting factor patterns. See Biniossek et al., Impens etal., Tholen et al. Only the data for anchor position of P1 is shown, butanalogous patterns are obtained using a P1′ anchor. Overall theperformance of the predictors has an average sensitivity and specificityof approximately 0.9. The performance is considerably better whencertain amino acids are at the anchor position, than it is for others.Further, the impact of the weighting factor is large for certain aminoacids and not for others. These differences in the predictor metrics arenot due to the relative abundance of a particular amino acid at P1 orP1′; which because of the way the cohort sets were constructed greaterabundance of a particular amino acid results in a larger training sets.Each of the peptidases has a unique pattern of sensitivity andspecificity for particular amino acid anchors and these patterns aremaintained at different reaction time points.

Amino Acids Generating Variable Performance

While many amino acid anchors gave consistently good results, for a fewamino acids the performance of the predictors was problematic. Mostobvious in this regard were the predictions for either a leucine orphenylalanine anchor at P1 in murine cathepsin D and E. The results forcathepsin E are shown in FIG. 5. Interestingly, similarly poorperformance was seen with these amino acids anchoring the P1′ positionas well (not shown). This problem could be partially rectified using adual anchor approach where P1 and P1′ were both fixed, and thus theprediction process effectively limited to predicting how the flankingthree positions on each side affect the probability of cleavage. Thisimproved performance of the predictors for some amino acids, but a fewcontinue to be problematic. A clue to the possible reason underlying forthis can be seen in frequency logos (FIG. 11). The amino acids with poorprediction metrics are frequently found in series (e.g. ˜LLLL˜). Thenumerical consequence of sequentially repeating runs of these aminoacids within the CSO would be that the all three z-scale vectors forthose peptides would have a series of identical numbers in theleast-squares analysis and thus render no pattern-predictive value.Nevertheless, even using the dual-anchored predictors, sensitivity andspecificity values for murine cathpesins D, E and L were lower than forthe other peptidases, ranging from 0.80-0.85 (FIG. 13). It is perhapsnoteworthy that the input datasets where these characteristics occur arefrom degradomic techniques, as opposed to the peptide-centric approachPIC.

Example 2 Comparison with Support Vector Machine Prediction

Support vector machines are a commonly used type of classifier and in avariety of tests have been generally been shown to perform very well.Because of the underlying mathematics SVMs are thought to be lesssusceptible to over-fitting than neural networks. The e1071 R(www.r-project.org/) package implements one of the most widely usedobject code libraries (www.csie.ntu.edu.tw/˜cjlin/libsvm/) of an SVMthat has been widely tested. JMP® V10 has the ability to send data toand execute statistical routines in the R package and to capture theresults returned. Therefore the R SVM was compared with the JMP®probabilistic neural net using the same training sets and the samemultiple ensemble cross validation approach.

The training sets used were from human cathepsin L cleavage. SeeBinissek et al. Two unweighted training sets were used, one with alanineat the P1 anchor position and the other with glycine at the P1 anchor.The alanine training set contained 111 cleaved peptides and the glycineset contained 222 cleaved peptides. These amino acids were selectedbecause they produced intermediate results in the NN (FIG. 4). Thetraining process used a 5 K-fold cross-validation scheme as describedabove.

The SVM classifier was tuned for each cohort set by choosing an optimumC and gamma factor using the SVM-tune platform in the R package. Thenumber of hidden nodes in the JMP® NN was fixed at 8 and the defaultsettings used for controlling the convergence of the underlyingalgorithms.

FIG. 13 shows the results of the comparison. In this evaluation he NNout-performed the SVM in predicting the experimental results.

Example 3 Comparison with Experimental Data: Myelin Basic Protein

The data presented above are all derived from supplemental datasets inthe papers indicated. Neural networks have an innate tendency tomemorize what they have seen, so a fundamental question is how well dothe predictors perform with other data that they have not “seen”? Overseveral decades there have been many studies of cathepsin peptidasecleavage specificity. Turk, V. et al. Cysteine cathepsins: fromstructure, function and regulation to new frontiers. Biochim. Biophys.Acta 1824, 68-88 (2012). A comprehensive study was conducted by Beck etal. in which both lysozomal extracts and several cathepsins, appliedindividually and in combination, were applied to the C-terminal half ofmyelin basic protein, a relatively small protein implicated in theetiology of multiple sclerosis. Because of its design the study allowscomparison of peptidases used individually and in combinations andtherefore provides a useful comparative dataset.

The sequence for the myelin basic protein isoform used by Beck et al.Genbank ID (Ser. No. 17/378,805; P02686) was retrieved from NCBI. Themolecule was dissected into octomers for use in predictions. The CSOsstratified by P1 and P1′ and the discriminant functions operating on theCSO produced a cleavage probability at the P1 or P1′ of the particularpeptide. The cleavage predictions produced by the discriminate functionsfor the different peptidases at different times are highly concordant;for simplification of presentation the different discriminationfunctions for each of the peptidases was consolidated.

FIG. 6 shows the consolidated results of using the PCAA predictors ofboth P1 and P1′ for the C-terminal segment of same isoform of myelinbasic protein used by Beck et al.

The overall concordance between the experimental results and thepredictors is quite good, but it is complex and can best be appreciatedby comparing the patterns in FIG. 6 with those in FIG. 3 of the originalBeck et al publication. Inset sequences in FIG. 6 are several cleavagesites identified by the authors as critical to the immune response tomyelin basic protein. In line with other experimental results thepredictions for the cathepsins used tend to show partially redundantcleavage specificity. Turk et al. Interestingly, for a large number ofthe cleavages where Beck et al attribute cleavage to bulk lysozomalactivity, the predictors for cathepsin B (not used by Beck et al),cathepsin D or E were found to indicate high probabilities of cleavage.

Example 4 Prediction of Peptidase Cleavage Sites in Plant Allergens

Proteins from peanuts are well known as allergens. Stanley et al(Stanley, J. S. et al. Identification and mutational analysis of theimmunodominant IgE binding epitopes of the major peanut allergen Ara h2. Arch. Biochem. Biophys. 342, 244-253 (1997)) described IgE bindingepitopes in the major peanut allergen Ara h2, referring to isoform 1.Prickett et al (Prickett, S. R. et al. Ara h 2 peptides containingdominant CD4+ T-cell epitopes: candidates for a peanut allergytherapeutic. J. Allergy Clin. Immunol. 127, 608-615 (2011))characterized five CD4+ epitopes in this protein. An analysis of thepeptidase cleavage sites in this protein was performed using the methodsdescribed herein. FIG. 16 shows the predicted cleavage sites by multiplecathepsins. A cleavage site was predicted immediately prior (N terminalside) to each of three epitope peptides characterized by Prickett.

Example 5 Binding Characteristics and Cleavage of the CLIP Peptides

The “CLIP” peptide (MHC Class II invariant peptide) is produced byendosomal cleavage of the MHC gamma, also known as the “invariantchain”. The MHC-gamma allele is one of the genes in the MHC locus andhas substantial structural similarity to the MHC molecules. It appearsto have two purposes: First, a portion of the molecule binds in themolecular groove of the MHC II molecule and is used as a chaperone forguiding the MHC II molecule in the endosomal compartment where peptideloading takes place. Second, the CLIP peptide when released by endosomalpeptidase activity, binds only with a moderate binding affinity to manydifferent MHC II alleles and serves as a placeholder for other peptidesthat will be loaded into the MHC molecule in its place with theassistance of MHC-DM for ultimate presentation on the surface of theantigen presenting cell.

Experimental evidence shows that the so-called “CLIP peptide” is not asingle peptide but actually a group of peptides with slightly differentlengths (ragged ends) produced by differential endosomal peptidasecleavage activity. The longest of these peptides has the sequenceLPKPPKPVSKMRMATPLLMQALPMG (SEQ ID NO:5). The underlined sequence hasbeen shown in experiments by others (Chicz R M et al, 1993 J Exp Med.,Specificity and Promiscuity among Naturally Processed Peptides Bound toHLA-DR Alleles; Villadangos, J A et al 1997, J Exp Med, Degradation ofMouse Invariant Chain: Roles of Cathepsins S and D and the Influence ofMajor Histocompatability Complex Polymorphism) to be the primary bindingregion. It has the characteristic of binding to many different MHC IImolecules (so is sometimes called a promiscuous peptide) with what isgenerally considered a moderate affinity of about e6.26=525 nMequivalent to about −0.96σ (approx −1σ) below the mean (FIG. 17). Infact, the neural net (NN) predictions suggest that several differentbinding registers will bind with a very similar binding affinity (IC50).An interesting feature of this molecule was determined recently wherenew experiments (Schlundt, A. 2012, J Mol. Biol Peptide Linkage to thealpha Subunit of MHC II Creates a Stably Inverted Antigen PresentationComplex) suggest that the peptide can bind, not only in the standard orcanonical N⋄C orientation, but also in the reverse CO N orientation.Interestingly, the NN predictions for binding of the reverse peptide arealso very comparable to the canonical orientation. As can be seen inFIG. 18, for several binding registers the affinity is actually higherfor the inverted orientation that the canonical. A caveat to thisobservation is that the experimental procedures that are used toestimate the binding affinity to an MHC molecule are a bulk measurementdone without knowledge of which orientation the peptides are assuming.It could be that the molecule assumes N⋄C orientation or a C⋄N or amixture of the two. FIG. 19 shows the predicted binding affinity inseveral different binding registers for the canonical and invertedpeptide orientation for a single common human MHC II allele(DRB1*01:01). The results are typical of those observed for the other 27human alleles, as well as those from other species such as the mouse.

It follows that peptides from proteins of other derivations (includingbut not limited to microbial, mammalian, insect, allergen, etc.) mayalso be bound to MHC molecules in canonical or non-canonical orientationand thus may be presented by MHC as T cell epitopes in eitherorientation.

Furthermore, with reference to CLIP, the experimental determinations(Chicz et al vide infra) of peptide presentation by MHC molecules onantigen presenting cells provided a system for independent verificationof the cathepsin cleavage predictions. The peptides presented on the MHCmolecules at the cell surface should have been excised by the endosomalpeptidases. Therefore the NN cleavage predictions for the endosomalpeptidases, cathepsin B, L and S were compared to the N and C termini ofthe presented peptides. The cleavage predictions were found to be highlyconcordant with peptides attached to MHC II molecules and which had beendetected by mass spectrometry. The endosomal peptidases are quiteaggressive enzymes and cleave at a wide variety of amino acid sequences.Consistent with this, several different cleavage positions are predictedin this vicinity of the CLIP peptide molecule. The median length ofpeptide eluted and detected by mass spectrometry is 17 amino acids ortwo amino acids longer than the 15-mer binding pocket in the MHC IImolecule generally recognized. The results of this Comparison are shownin FIG. 20. The experiments were carried out with virus transformedhuman B cells and in this cell type cathepsin S is thought to be thepredominant endopeptidase activity. The primary eluted peptide(MRMATPLLMQALPM; SEQ ID NO:6) can be seen in FIG. 20 to be bracketed bythe cathepsin S cleavage on both the N- and C-termini.

In addition to the invariant chain several additional peptides were alsofound loaded into the MHC II molecule (Chicz et al, vide infra). Thesepeptides had “ragged” ends extending several amino acids in both theN-terminal and C-terminal side of the binding pocket. In each case thecleavage predictions matched the peptides that were detected.

The experiments described above with reference to CLIP showed a criticalrelationship between cathepsin cleavage and MHC presentation. Theobservations with MHC II were extended and shown to also be consistentwith observations with respect to MHC I presentation of the shorter9-mer peptides. Peptides presented on cell surfaces bound to MHC Imolecules arise via proteasomal cleavages of protein molecules taggedfor destruction in the cytoplasm. Resulting fragments produced in thispart of the process are longer (up to about 20-mers) than can beaccommodated in the binding pocket of MHC I molecules. These fragmentsare delivered to the MHC loading compartment by specialized molecularmachinery called TAP (transporter associated antigen processing) wherethe resident peptidases trim the peptides to fit into the bindinggroove.

Example 6 Prediction of Peptidase Cleavage Sites in an ImmunogenicBrucella melitensis Protein

A publication by Durward et al (Durward, M. A., Harms, J., Magnani, D.M., Eskra, L., & Splitter, G. A. Discordant Brucella melitensis antigensyield cognate CD8+ T cells in vivo. Infect. Immun. 78, 168-176 (2010))reported the experimental evidence for a cytotoxic CD8+ epitope inBrucella melitensis methionine sulphoxide reductase B. The 9-mer peptideof interest was RYCINSASL (SEQ ID NO: 7), located at positions 116 to124. The predicted peptidase cleavages we determined by the methodsdescribed above. FIG. 15 shows the predicted cleavage by humancathepsins L, S and B and murine cathepsins D, E and L. It can be seenthat there is a higher predicted probability of cleavage by cathepsinimmediately proximal and distal to the peptide characterized by Durwardet al.

Binding Characteristics and Cleavage of Brucella melitensis MethionineSulphoxide Reductase

A 9-mer peptide, RYCINSASL (RL9) (SEQ ID NO: 7) from Brucella melitensismethionine sulphoxide reductase B has been found to be presented on MHCI molecules and produce populations of T-cells which recognize the pMHCcomplex (Durward, M. et al 2010 Infection and Immunity DiscordantBrucella melitensis Antigens Yield Cognate CD8 T Cells In Vivo). Furtherto the studies published by Durward et al., immunization of mice withthe RL9 peptide leads to a protective response pattern in mice. Twoversions of the RL-G2aFc molecule shown in FIG. 21 were produced, onewith an N-terminal peptide fusion, the other with both N-terminal andC-terminal peptide fusions. Mice were immunized to test the two carrierproteins carrying known effective peptide (RL9). Mice immunized with theRL-G2a(CH2-CH3) or RL-G2a(CH2-CH3)-RL construct were able to reduce thenumber of RL9-pulsed target cells at a significantly higher rate thancontrol immunized mice (FIG. 22) indicating that RL-G2a(CH2-CH3) vaccineinduces a cellular cytotoxic response against target splenocytesdisplaying RL9 peptide, consistent with the protective response patternknown to eliminate Brucella infection. The data show the G2a(CH2-CH-3)carrier protein bearing the larger CEG peptide is correctly cleaved andRL9 peptide specific effector cells are created.

In view of the observations of the critical role of cathepsin cleavagein presentation of CLIP peptides, described in the Example 5 above, anexperiment was designed to further examine the role of cathepsin inepitope definition.

An interesting feature of the peptide identified, RYCINSASL (SEQ ID NO:7), is that it is derived from the active site of a metabolic enzymewidely distributed in nature. Mice contain the identical 9-mer peptidein their mitochondrial version of the enzyme. Thus, it would be expectedthat the mice would recognize the RYCINSASL (SEQ ID NO: 7) peptide as“self” and not produce an immunological response. Nevertheless, miceinfected with B. melitensis produce a profound immunological response tothis peptide; it is not recognized as self. The flanking residuesoutside of the active site 9-mer are quite different between the murineendogenous and B. melitensis forms of the enzyme. The differences inamino acids in the flanking positions change the probability of the N-and C-terminal bonds being cleaved. In contrast to the peptide from B.melitensis, the peptide in the mouse mitochondrial enzyme is notpredicted to be excised (FIG. 23).

In order to test this experimentally, 6 amino acids on the N- andC-terminal side of the RL9 peptide in the Brucella enzyme were replacedto make it non-cleavable. This is shown in FIG. 24.

Cloning of Brucella RL(105-135) Peptides into mG2a Carrier

Existing wild-type Brucella RL(105-135) peptide cloned into p500695. Thewt Brucella amino acid sequence contains cathepsin S cleavage sitesupstream of the RL9 peptide as shown in SEQ 1 and 2.

Modified Brucella RL(105-135)mod peptide was cloned into mG2a carrier,this sequence has the RL9 flanking regions from Brucella replaced withmurine flanking regions that are predicted to have no cathepsin Scleavage sites, the two flanking regions are marked in SEQ 3 and 4 andin FIG. 8.

Cloning procedure: The amino acid sequence encoding Brucella melitensismethionine sulfoxide reductase (Accession #NP 541797) position aa105-135 was backtranslated using the Lasergene software (DNAstar,Madison, Wis.) built-in mammalian non-degenerate backtranslation code.Proper restriction enzyme sites were added to both ends of theRL(105-135) sequence and the nucleotide sequence was synthesized using acommercial vendor (IDT, Coralville, Iowa). The sequence for the modifiedRL(105-135)mod was similarly assembled in silico and then submitted forsynthesis. The obtained synthesized RL(15-135) gene sequences aredigested with the specific restriction sites and in-frame clonedupstream of the murine G2a (hinge-CH2-CH3)-containing retroviralexpression retrovector.

In vivo testing: The expression retrovectors containing the RL(105-135)or RL(105-135)mod sequence were used to make stable CHO expression celllines to produce both peptides as N-terminal murine IgG2a hinge-Fcportion. BR-RL(105-135)-mG2a and BR-RL(105-135)mod-mG2a is harvestedfrom cell supernatant and used to immunize mice via subcutaneousinjection at the tail at 25 ug/mouse dose and formulated with Sigma(S6322) and CpG adjuvants. One or two boosts are given after the firstinjection. One week after the last boost, splenocytes are collected fromimmunized mice and cultivated in vitro. Splenocytes from naïve mice arepulsed with synthesized RL9 or irrelevant control peptide and then addedto the harvested effector cells. After a 5 h incubation, cells areharvested and monitored for T-cell phenotype (CD4, CD8, CD3), activationstatus (LFA-1) and intracellular cytokine (INFg) production using flowcytometry. This analysis will yield information as to whether theremoval of a predicted cathepsin cleavage site changes processing ofpeptides upon uptake by antigen presenting cells and subsequentstimulation of T-cells.

It is anticipated that mice vaccinated with the modified peptidesequence will not display the peptide on the MHC I surface molecules norgenerate a T-cell response.

BR-RL(105-135)-G2a(CH2-CH3)-BR-RL, nucleotide sequence, ID:500695nSeq. Id No. 1. .........o.........o.........o.........o.........oTTCCCCGACGGCCCCGTGGACCGCGGCGGCCTGCGCTACTGCATCAACTCCGCCTCCCTGCGCTTCGTGCCCAAGGACCGCATGGAGGCCGAG 1-93 BR-RL(105-135)BR-RL(105-135)-G2a(CH2-CH3)-BR-RL, amino acid sequence, ID:500695pSeq. Id No. 2. .........o.........o.........o.........o.........o1-31 FPDGPVDRGGLRYCINSASLRFVPKDRMEAE BR-RL(105-135)mod-G2a(CH2-CH3)-BR-RL, nucleotide sequence, ID:500695n Seq. Id No. 3..........o.........o.........o.........o.........oTTCCCCGACGGCCCTCCTCGTCCGACCGGCAAAAGATACTGCATCAACTCAGCATCCTTGTCCTTCACTCCTGCAGACCGCATGGAGGCCGAG 1-15 Brucella sequence16-33 Murine sequence 34-60 Brucella RL9 peptide 61-78 Murine sequence79-93 Brucella sequence BR-RL(105-135)mod-G2a(CH2-CH3)-BR-RL, amino acid sequence, ID:500695p Seq. Id No. 4..........o.........o.........o.........o.........o1-31 FPDGPPRPTGKRYCINSASLSFTPADRMEAE 1-5 Brucella sequence6-11 Murine sequence 12-20 Brucella RL9 peptide 21-26 Murine sequence27-31 Brucella sequence

REFERENCE LIST

-   1. Kleifeld, O. et al. Isotopic labeling of terminal amines in    complex samples identifies protein N-termini and protease cleavage    products. Nat. Biotechnol. 28, 281-288 (2010).-   2. Doucet, A., Butler, G. S., Rodriguez, D., Prudova, A., &    Overall, C. M. Metadegradomics: toward in vivo quantitative    degradomics of proteolytic post-translational modifications of the    cancer proteome. Mol. Cell Proteomics. 7, 1925-1951 (2008).-   3. auf dem, K. U. & Schilling, O. Proteomic techniques and    activity-based probes for the system-wide study of proteolysis.    Biochimie 92, 1705-1714 (2010).-   4. Impens, F. et al. MS-driven protease substrate degradomics.    Proteomics. 10, 1284-1296 (2010).-   5. Agard, N. J. & Wells, J. A. Methods for the proteomic    identification of protease substrates. Curr. Opin. Chem. Biol. 13,    503-509 (2009).-   6. Schilling, O. & Overall, C. M. Proteome-derived,    database-searchable peptide libraries for identifying protease    cleavage sites. Nat. Biotechnol. 26, 685-694 (2008).-   7. Biniossek, M. L., Nagler, D. K., Becker-Pauly, C., &    Schilling, O. Proteomic identification of protease cleavage sites    characterizes prime and non-prime specificity of cysteine cathepsins    B, L, and S. J. Proteome. Res. 10, 5363-5373 (2011).-   8. Impens, F. et al. A quantitative proteomics design for systematic    identification of protease cleavage events. Mol. Cell Proteomics. 9,    2327-2333 (2010).-   9. Tholen, S. et al. Contribution of cathepsin L to secretome    composition and cleavage pattern of mouse embryonic fibroblasts.    Biol. Chem. 392, 961-971 (2011).-   10. Schechter, I. & Berger, A. On the size of the active site in    proteases. I. Papain. Biochem. Biophys. Res. Commun. 27, 157-162    (1967).-   11. Ng, N. M. et al. The effects of exosite occupancy on the    substrate specificity of thrombin. Arch. Biochem. Biophys. 489,    48-54 (2009).-   12. Boyd, S. E., Pike, R. N., Rudy, G. B., Whisstock, J. C., &    Garcia, d.l.B. PoPS: a computational tool for modeling and    predicting protease specificity. J. Bioinform. Comput. Biol. 3,    551-585 (2005).-   13. Shen, H. B. & Chou, K. C. Identification of proteases and their    types. Anal. Biochem. 385, 153-160 (2009).-   14. Chou, K. C. & Shen, H. B. ProtIdent: a web server for    identifying proteases and their types by fusing functional domain    and sequential evolution information. Biochem. Biophys. Res. Commun.    376, 321-325 (2008).-   15. Lohmuller, T. et al. Toward computer-based cleavage site    prediction of cysteine endopeptidases. Biol. Chem. 384, 899-909    (2003).-   16. Song, J. et al. Bioinformatic approaches for predicting    substrates of proteases. J. Bioinform. Comput. Biol. 9, 149-178    (2011).-   17. Yang, Z. R. Prediction of caspase cleavage sites using Bayesian    bio-basis function neural networks. Bioinformatics. 21, 1831-1837    (2005).-   18. Rognvaldsson, T. et al. How to find simple and accurate rules    for viral protease cleavage specificities. BMC. Bioinformatics. 10,    149 (2009).-   19. Rognvaldsson, T. & You, L. Why neural networks should not be    used for HIV-1 protease cleavage site prediction. Bioinformatics.    20, 1702-1709 (2004).-   20. El-Manzalawy, Y., Dobbs, D., & Honavar, V. On evaluating MHC-II    binding peptide prediction methods. PLoS. One. 3, e3268 (2008).-   21. Bremel, R. D. & Homan, E. J. An integrated approach to epitope    analysis I: Dimensional reduction, visualization and prediction of    MHC binding using amino acid principal components and regression    approaches. Immunome. Res. 6, 7 (2010).-   22. Bremel, R. D. & Homan, E. J. An integrated approach to epitope    analysis II: A system for proteomic-scale prediction of    immunological characteristics. Immunome. Res. 6, 8 (2010).-   23. Sjostrom, M. et al. Peptide QSARS: PLS modelling and design in    principal properties. Prog. Clin. Biol. Res. 291, 313-317 (1989).-   24. Hellberg, S., Sjostrom, M., Skagerberg, B., & Wold, S. Peptide    quantitative structure-activity relationships, a multivariate    approach. J. Med. Chem. 30, 1126-1135 (1987).-   25. Linusson, A., Elofsson, M., Andersson, I. E., & Dahlgren, M. K.    Statistical molecular design of balanced compound libraries for QSAR    modeling. Curr. Med. Chem. 17, 2001-2016 (2010).-   26. Linusson, A., Gottfries, J., Lindgren, F., & Wold, S.    Statistical molecular design of building blocks for combinatorial    chemistry. J. Med. Chem. 43, 1320-1328 (2000).-   27. Du, Q. S., Huang, R. B., & Chou, K. C. Recent advances in QSAR    and their applications in predicting the activities of chemical    molecules, peptides and proteins for drug design. Curr. Protein    Pept. Sci. 9, 248-260 (2008).-   28. Bishop, C. M. Neural Networks for Pattern Recognition (Oxford    University Press, Oxford, 1995).-   29. Turk, V. et al. Cysteine cathepsins: from structure, function    and regulation to new frontiers. Biochim. Biophys. Acta 1824, 68-88    (2012).-   30. Duda, R. O., Hart, P. E., & Stork, D. G. Pattern Classification    (John Wiley & Sons, Inc., 2001).-   31. Beck, H. et al. Cathepsin S and an asparagine-specific    endoprotease dominate the proteolytic processing of human myelin    basic protein in vitro. Eur. J. Immunol. 31, 3726-3736 (2001).-   32. Du, Q. S., Wei, Y. T., Pang, Z. W., Chou, K. C., & Huang, R. B.    Predicting the affinity of epitope-peptides with class I MHC    molecule HLA-A*0201: an application of amino acid-based peptide    prediction. Protein Eng Des Sel 20, 417-423 (2007).-   33. Choo, K. H., Tan, T. W., & Ranganathan, S. A comprehensive    assessment of N-terminal signal peptides prediction methods. BMC.    Bioinformatics. 10 Suppl 15, S2 (2009).-   34. Colaert, N., Helsens, K., Martens, L., Vandekerckhove, J., &    Gevaert, K. Improved visualization of protein consensus sequences by    iceLogo. Nat. Methods 6, 786-787 (2009).-   35. Rigaut, K. D., Birk, D. E., & Lenard, J. Intracellular    distribution of input vesicular stomatitis virus proteins after    uncoating. J. Virol. 65, 2622-2628 (1991).-   36. Schilling, O., auf dem, K. U., & Overall, C. M. Factor Xa    subsite mapping by proteome-derived peptide libraries improved using    WebPICS, a resource for proteomic identification of cleavage sites.    Biol. Chem. 392, 1031-1037 (2011).-   37. Chou, K. C. Some remarks on protein attribute prediction and    pseudo amino acid composition. J. Theor. Biol. 273, 236-247 (2011).-   38. Chawla, N., Lazarevic, A., Hall, L., & Bowyer, K. SMOTEBoost:    Improving prediction of the minority class in boosting. Knowledge    Discovery in Databases: PKDD 2003107-119 (2003).-   39. Chawla, N., Eschrich, S., & Hall, L. O. Creating ensembles of    classifiers. Data Mining, 2001.ICDM 2001, Proceedings IEEE    International Conference on, 580-581. 2001. IEEE.-   40. Chawla, N. V. Data mining for imbalanced datasets: An overview.    Data Mining and Knowledge Discovery Handbook 875-886 (2010).-   41. Cieslak, D. A. & Chawla, N. V. Start globally, optimize locally,    predict globally: Improving performance on imbalanced data. Data    Mining, 2008.ICDM'08. Eighth IEEE International Conference on,    143-152. 2008. IEEE.-   42. Lichtenwalter, R. & Chawla, N. Adaptive methods for    classification in arbitrarily imbalanced and drifting data streams.    New Frontiers in Applied Data Mining 53-75 (2010).-   43. Tang, Y., Zhang, Y. Q., Chawla, N. V., & Krasser, S. SVMs    modeling for highly imbalanced classification. Systems, Man, and    Cybernetics, Part B: Cybernetics, IEEE Transactions on 39, 281-288    (2009).-   44. Yu, B., Fonseca, D. P., O'Rourke, S. M., & Berman, P. W.    Protease cleavage sites in HIV-1 gp120 recognized by antigen    processing enzymes are conserved and located at receptor binding    sites. J. Virol. 84, 1513-1526 (2010).-   45. Beck, H. et al. Cathepsin S and an asparagine-specific    endoprotease dominate the proteolytic processing of human myelin    basic protein in vitro. Eur. J. Immunol. 31, 3726-3736 (2001).-   46. Rawlings, N. D., Barrett, A. J., & Bateman, A. MEROPS: the    database of proteolytic enzymes, their substrates and inhibitors.    Nucleic Acids Res. 40, D343-D350 (2012).

1. A computer implemented process for providing desired peptidasecleavage sites in a synthetic polypeptide comprising: a) obtaining anamino acid sequence for a target polypeptide; b) deriving an ensemble ofpeptidase cleavage prediction equations for each possible cleavage sitedimer by: (i) assembling experimentally derived data comprising amultiplicity of measurements of amino acid physicochemical properties;(ii) producing a correlation matrix of the experimentally derived data;(iii) deriving by Principal Component Analysis multiple uncorrelateddimensionless, weighted and ranked proxy descriptors to describe atleast 80% of the variance in said physicochemical properties ofindividual amino acids, (iv) using said proxy descriptors to describeindividual amino acids in a set of peptides each of which comprises aspecific cleavage site dimer experimentally determined to be cleaved,and to describe the individual amino acids in a set of peptides each ofwhich comprises the same cleavage site dimer experimentally determinedto be uncleaved; (v) comparing, via probabilistic modeling, the aminoacid descriptors of said peptides comprising said experimentallydetermined cleaved and uncleaved cleavage site dimers to derive acleavage prediction equation for said specific cleavage site dimer basedon that peptide set; (vi) repeating the steps of (iv) and (v) multipletimes, each time with a different set of peptides that comprise the samecleavage site dimer and are experimentally determined to be cleaved ornot cleaved, thereby deriving an ensemble of independently derivedpeptidase cleavage prediction equations for said specific cleavage sitedimer; (vii) repeating the process of (iv) to (vi) to derive ensemblesof independently derived prediction equations for every possiblecleavage site dimer up to a total of 400 such ensembles; (viii) storingsaid ensembles of independently derived prediction equations on anon-transitory computer readable medium; (c) in-putting said amino acidsequence from said target polypeptide into a computer; (d) applying saidproxy descriptors from said Principal Component Analysis to describeindividual amino acids in said target polypeptide sequence; (e) derivingvectors to describe a plurality of peptides of defined length in saidtarget polypeptide; (f) via said computer processor applying said up to400 ensembles of independently derived peptidase cleavage predictionequations to said plurality of peptides of defined length from saidtarget polypeptide to predict a plurality of peptidase cleavage sites insaid target polypeptide; (g) identifying, by consensus of the ensemblesof prediction equations, the probability of cleavage at any givencleavage site dimer in said target polypeptide; (h) defining amino acidssubsets in said target polypeptide that are predicted to be cleaved by apeptidase; (i) substituting one or more amino acids in the targetpolypeptide to change the probability of peptidase cleavage; (j)synthesizing a synthetic polypeptide comprising the substituted aminoacids; and (k) administering said synthetic polypeptide to a subject. 2.The process of claim 1, wherein said probabilistic modeling is by aprobabilistic neural net or a support vector machine.
 3. The process ofclaim 2, wherein said probabilistic neural net comprises a multi-layerperceptron neural network regression process wherein the output is theprobability of cleavage by a particular peptidase within a particularcleavage site dimer within a particular amino acid sequence.
 4. Theprocess of claim 3, wherein said probabilistic neural net predicts apeptidase cleavage site with greater than about 70, 80, 90, or 95%accuracy.
 5. The process of claim 4, further comprising utilizing anumber of hidden nodes in said multi-layer perceptron that correlates tothe number of amino acids in the cleavage site octomer.
 6. The processof claim 1, wherein said amino acid subset in said protein of interestis from about 4 to about 50 amino acids in length.
 7. The process ofclaim 1, wherein said peptide of defined length is 8 amino acids inlength.
 8. The process of claim 1, wherein said subsets of amino acidsequences begin at n-terminus of the amino acid sequence, wherein n isthe first amino acid of the sequence and c is the last amino acid in thesequence, and the sets comprise each peptide of 8 amino acids in lengthstarting from n and the next peptide in the set is n+1 until n+1 ends atc for the given length of the peptides selected.
 9. The process of claim1, wherein said physicochemical properties of individual amino acids areselected from the group consisting of polarity, optimized matchinghydrophobicity, hydropathicity, hydropathcity expressed as free energyof transfer to surface in kcal/mole, hydrophobicity scale based on freeenergy of transfer in kcal/mole, hydrophobicity expressed as ΔG ½ cal,hydrophobicity scale derived from 3D data, hydrophobicity scalerepresented as π−r, molar fraction of buried residues, proportion ofresidues 95% buried, free energy of transfer from inside to outside of aglobular protein, hydration potential in kcal/mol, membrane buried helixparameter, mean fractional area loss, average area buried on transferfrom standard state to folded protein, molar fraction of accessibleresidues, hydrophilicity, normalized consensus hydrophobicity scale,average surrounding hydrophobicity, hydrophobicity of physiologicalL-amino acids, hydrophobicity scale represented as (π−r)2, retensioncoefficient in HFBA, retention coefficient in HPLC pH 2.1,hydrophobicity scale derived from HPLC peptide retention times,hydrophobicity indices at pH 7.5 determined by HPLC, retentioncoefficient in TFA, retention coefficient in HPLC pH 7.4, hydrophobicityindices at pH 3.4 determined by HPLC, mobilities of amino acids onchromatography paper, hydrophobic constants derived from HPLC peptideretention times, and combinations thereof.
 10. The process of claim 1,wherein said contribution of the physical properties of each amino acidsin said subsets to the peptidase cleavage site is weighted according tothe amino acid and its position in each peptide in said set of peptides11. The process of claim 1, wherein said process is applied sequentiallyto determine cleavage by two or more peptidases.
 12. The process ofclaim 1, wherein said peptidase is a eukaryotic, prokaryotic, viral orsynthetic peptidase.
 13. The process of claim 1, wherein said peptidaseis an endopeptidase drawn from the group comprising a serine peptidase,a cysteine peptidase, an aspartic peptidase, a glutamic peptidase, anasparagine peptidase, a threonine peptidase, or a metallo-peptidase. 14.The process of claim 1, wherein said peptidase is a cathepsin.
 15. Theprocess of claim 1 wherein said change in the probability of peptidasecleavage is an increased probability of cleavage.
 16. The process ofclaim 1 wherein said change in the probability of peptidase cleavage isa decreased probability of cleavage.
 17. The process of claim 1 whereinsaid synthetic polypeptide is a biopharmaceutical polypeptide orprotein.
 18. The process of claim 17 wherein said biopharmaceuticalpolypeptide or protein is drawn from the group comprising enzymes,clotting factors, monoclonal antibodies and antibody fusions.
 19. Theprocess of claim 17 wherein said biopharmaceutical polypeptide orprotein is an immunogen.
 20. The process of claim 1 wherein said targetpolypeptide is an allergen.
 21. The process of claim 1 wherein saidtarget polypeptide elicits an autoimmune reaction.
 22. The process ofclaim 1 further comprising: determining if the target polypeptidecomprises a predicted B cell epitope or a predicted T cell epitopepeptide and substituting said one or more amino acids in the targetpolypeptide to change the probability of peptidase cleavage to cleavesaid predicted B or T cell epitope peptide, thereby removing itsimmunogenicity.
 23. The process of claim 1, wherein said peptidase is inan antigen presenting cell.
 24. The process of claim 23, wherein saidpeptidase is secreted
 25. The process of claim 23, wherein saidpeptidase is membrane associated
 26. The process of claim 23, whereinsaid peptidase functions in digestion.
 27. The process of claim 23,wherein said peptidase functions in cell signaling.
 28. The process ofclaim 23, wherein said peptidase functions in immune processing.
 29. Ahost cell comprising a nucleic acid encoding a polypeptide identified instep (i) of claim 1.